AntiAtropos / deploy /aws /ARCHITECTURE.md
div18
model changes
9db539d
# AntiAtropos Architecture Guide
A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.
---
## The Big Picture
AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.
The system is split across two platforms:
```
Hugging Face Spaces AWS
===================== ======================
The "brain" The "muscle"
AntiAtropos FastAPI server EKS (Kubernetes cluster)
- Runs the simulator - Runs the actual microservice pods
- Runs the SRE agent logic - The agent scales these pods
- Queries Prometheus for metrics - Prometheus Agent scrapes metrics
- Sends scale commands to K8s - Metrics flow to AMP
- Grafana (AMG) visualizes it all
```
Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.
---
## Kubernetes Concepts You Need
### Pod
The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").
Think of it as: one running instance of a service.
### Deployment
A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.
The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down.
```
Deployment: payments
replicas: 3 <-- the agent changes this number
|
+-- Pod: payments-abc123 (running)
+-- Pod: payments-def456 (running)
+-- Pod: payments-ghi789 (running)
```
**The agent scales replicas, not pods.** When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods.
### Service
A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy.
### Namespace
A namespace is a folder for organizing resources. We use:
- `prod-sre` — where the 5 microservice Deployments live
- `monitoring` — where the Prometheus Agent pod lives
- `kube-system` — where AWS/EKS system pods live
### Node
A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config).
```
EKS Cluster
Node 1 (t3.medium - 4 vCPU, 8GB RAM)
Pod: payments-abc123
Pod: checkout-def456
Pod: catalog-ghi789
Pod: prometheus-agent-xyz
Node 2 (t3.medium - 4 vCPU, 8GB RAM)
Pod: payments-jkl012 <-- agent scaled payments from 1 to 2
Pod: cart-mno345
Pod: auth-pqr678
```
### ResourceQuota
A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.
---
## How the SRE Agent Works
### The Loop
Every "tick" (one step of the simulation), the agent goes through this cycle:
```
1. OBSERVE -- Read telemetry (CPU, latency, queue depth) from Prometheus
2. DECIDE -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
3. ACT -- Send the action to KubernetesExecutor
4. REWARD -- Compute Lyapunov stability reward (was the cluster more or less stable?)
5. REPEAT
```
### How Each Action Works
| Action | What the Agent Decides | What Happens on EKS |
|---|---|---|
| `SCALE_UP` | "node-0 needs more capacity" | `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` |
| `SCALE_DOWN` | "node-3 is over-provisioned" | `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` |
| `REROUTE_TRAFFIC` | "Move traffic away from node-2" | `KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments |
| `SHED_LOAD` | "Drop 50% of traffic to node-3" | `KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas` |
| `NO_OP` | "Do nothing this tick" | Nothing changes on EKS |
### The SCALE_UP Flow in Detail
Here is exactly what happens when the agent decides to scale up `node-0` (the payments service):
```
HF Spaces AWS EKS
---------- --------
Agent: "SCALE_UP, node-0, parameter=0.5"
|
v
AntiAtroposEnvironment.step()
|
v
KubernetesExecutor.execute_with_metadata()
|
v
_load_node_workload_map()
reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
|
v
_scale_deployment("SCALE_UP", "node-0", 0.5)
|
+-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
| Current replicas = 2
|
+-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
| Desired = min(6, 2 + 1) = 3 <-- max_replicas cap from env var
|
+-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
| body={"spec": {"replicas": 3}})
|
v +---------------------------+
Returns: "Ack: SCALE_UP for node-0 - | K8s creates 1 new pod: |
deployment payments in namespace | payments-newpod-xyz |
prod-sre scaled 2->3" +---------------------------+
```
### The Telemetry Flow in Detail
How the agent reads metrics from the real cluster:
```
EKS Cluster AMP HF Spaces
----------- --- ----------
Workload pods AMP Workspace AntiAtropos
(payments, checkout...) stores all metrics PrometheusClient
| ^ |
| /metrics (scraped every 15s) | |
v | |
Prometheus Agent | |
| | |
| remote-write (SigV4 auth) | |
+-------------------------------------------> |
| |
| HTTPS query |
+------------------------>
(PROMETHEUS_URL env var)
|
v
_fetch_real_metrics()
runs PromQL like:
sum(rate(http_requests_total[1m])) by (pod)
returns: TelemetryRecord for each node
```
---
## The Three Layers of Scaling Caps
This is the most important thing to understand for cost control. There are **three** independent limits:
### Layer 1: Python Code Cap (Soft)
**Where:** `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18.
**How it works:** The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets:
```
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
```
**Can it be bypassed?** Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly.
**Set to:** `6` on HF Spaces.
### Layer 2: Kubernetes ResourceQuota (Hard)
**Where:** `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace.
**How it works:** Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:
```
Error from server (Forbidden): pods "payments-new" is forbidden:
exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
```
**Can it be bypassed?** Only by someone with cluster-admin access who deletes or edits the ResourceQuota.
**Set to:** 30 pods total, 8 CPU, 8GB RAM.
### Layer 3: EKS Node Group Max Size (Hard)
**Where:** `eksctl-cluster.yaml``managedNodeGroups[0].maxSize: 4`.
**How it works:** The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.
**Can it be bypassed?** Only by someone editing the node group in the AWS console.
**Set to:** 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).
### How the Three Layers Work Together
```
Agent wants to scale all 5 deployments to 20 replicas each:
Layer 1 (Python cap): 6 replicas max per deployment -> agent gets "unchanged at 6"
5 x 6 = 30 pods maximum
Layer 2 (ResourceQuota): 30 pods max in namespace -> 31st pod is Forbidden
Layer 3 (Node group): 4 nodes max -> if 30 pods don't fit on 4 nodes,
some stay Pending (no cost)
Worst case with all caps: 30 pods on 4 nodes = ~$160/month
Without any caps: 100 pods on 25 nodes = ~$1,800/month
```
---
## The Mapping: Simulator Nodes to Real Deployments
The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to:
```
Simulator Node K8s Deployment Namespace Notes
------------- --------------- --------- -----
node-0 payments prod-sre VIP (4x importance weight)
node-1 checkout prod-sre Critical (no SHED_LOAD)
node-2 catalog prod-sre Critical (no SHED_LOAD)
node-3 cart prod-sre Non-critical (sheddable)
node-4 auth prod-sre Non-critical (sheddable)
```
When the simulator says "SCALE_UP node-0 by 0.5", the system:
1. Looks up node-0 in the workload map -> `payments` in `prod-sre`
2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)`
3. Kubernetes creates/destroys pods to match the new replica count
---
## What Runs Where (Complete List)
### On Hugging Face Spaces
| Component | What It Does | Port |
|---|---|---|
| FastAPI server (`server/app.py`) | HTTP API for the agent | 7860 (via NGINX) |
| Simulator (`simulator.py`) | 5-node microservice cluster simulation | Internal |
| PrometheusClient (`telemetry/prometheus_client.py`) | Queries AMP for real metrics | Outbound HTTPS |
| KubernetesExecutor (`control/kubernetes_executor.py`) | Sends scale commands to EKS | Outbound HTTPS |
| Prometheus metrics exporter | Serves `/metrics` for HF's monitoring | 8000 |
| Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 |
### On AWS EKS
| Component | Namespace | What It Does |
|---|---|---|
| payments Deployment | prod-sre | 2 nginx pods (scales with agent) |
| checkout Deployment | prod-sre | 1 nginx pod (scales with agent) |
| catalog Deployment | prod-sre | 1 nginx pod (scales with agent) |
| cart Deployment | prod-sre | 1 nginx pod (scales with agent) |
| auth Deployment | prod-sre | 1 nginx pod (scales with agent) |
| Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP |
| Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand |
### On AWS Managed Services
| Service | What It Does |
|---|---|
| AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. |
| AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. |
---
## The Simulator vs Real Cluster
AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`:
### Simulated Mode (`simulated`)
Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.
This is the default on HF Spaces without AWS configured.
### Hybrid Mode (`hybrid`)
The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.
### Live Mode (`live`)
The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money.
**Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.**
---
## Cost Flow
Every pod on EKS costs money. Here is how costs flow based on the agent's actions:
```
Agent action: SCALE_UP node-0
-> payments Deployment: replicas 2 -> 5
-> 3 new pods created
-> If existing nodes are full, Cluster Autoscaler adds a node
-> New node = another t3.medium EC2 instance = ~$0.04/hr
-> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota
Agent action: SCALE_DOWN node-3
-> cart Deployment: replicas 4 -> 1
-> 3 pods terminated
-> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
-> One fewer EC2 instance = saves ~$0.04/hr
```
The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:
```
R_t = -(alpha * delta_V + beta * cost + gamma * SLA_violation)
^^^^
beta=0.01 penalizes over-provisioning
```
---
## Quick Reference: Key Files
| File | Purpose |
|---|---|
| `kubernetes_executor.py` | Translates agent actions to K8s API calls |
| `prometheus_client.py` | Queries AMP for real metrics |
| `simulator.py` | 5-node fluid-queue simulation |
| `stability.py` | Lyapunov reward computation |
| `deploy/aws/k8s-workloads.yaml` | The 5 Deployments + ResourceQuota on EKS |
| `deploy/aws/eksctl-cluster.yaml` | EKS cluster definition (nodes, caps) |
| `deploy/aws/prometheus-agent-values.yaml` | Helm config for Prometheus Agent |
| `deploy/aws/generate-kubeconfig.sh` | Creates kubeconfig for HF Spaces |