File size: 14,900 Bytes
8cd4141 9db539d 8cd4141 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | # AntiAtropos Architecture Guide
A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.
---
## The Big Picture
AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.
The system is split across two platforms:
```
Hugging Face Spaces AWS
===================== ======================
The "brain" The "muscle"
AntiAtropos FastAPI server EKS (Kubernetes cluster)
- Runs the simulator - Runs the actual microservice pods
- Runs the SRE agent logic - The agent scales these pods
- Queries Prometheus for metrics - Prometheus Agent scrapes metrics
- Sends scale commands to K8s - Metrics flow to AMP
- Grafana (AMG) visualizes it all
```
Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.
---
## Kubernetes Concepts You Need
### Pod
The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").
Think of it as: one running instance of a service.
### Deployment
A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.
The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down.
```
Deployment: payments
replicas: 3 <-- the agent changes this number
|
+-- Pod: payments-abc123 (running)
+-- Pod: payments-def456 (running)
+-- Pod: payments-ghi789 (running)
```
**The agent scales replicas, not pods.** When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods.
### Service
A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy.
### Namespace
A namespace is a folder for organizing resources. We use:
- `prod-sre` — where the 5 microservice Deployments live
- `monitoring` — where the Prometheus Agent pod lives
- `kube-system` — where AWS/EKS system pods live
### Node
A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config).
```
EKS Cluster
Node 1 (t3.medium - 4 vCPU, 8GB RAM)
Pod: payments-abc123
Pod: checkout-def456
Pod: catalog-ghi789
Pod: prometheus-agent-xyz
Node 2 (t3.medium - 4 vCPU, 8GB RAM)
Pod: payments-jkl012 <-- agent scaled payments from 1 to 2
Pod: cart-mno345
Pod: auth-pqr678
```
### ResourceQuota
A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.
---
## How the SRE Agent Works
### The Loop
Every "tick" (one step of the simulation), the agent goes through this cycle:
```
1. OBSERVE -- Read telemetry (CPU, latency, queue depth) from Prometheus
2. DECIDE -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
3. ACT -- Send the action to KubernetesExecutor
4. REWARD -- Compute Lyapunov stability reward (was the cluster more or less stable?)
5. REPEAT
```
### How Each Action Works
| Action | What the Agent Decides | What Happens on EKS |
|---|---|---|
| `SCALE_UP` | "node-0 needs more capacity" | `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` |
| `SCALE_DOWN` | "node-3 is over-provisioned" | `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` |
| `REROUTE_TRAFFIC` | "Move traffic away from node-2" | `KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments |
| `SHED_LOAD` | "Drop 50% of traffic to node-3" | `KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas` |
| `NO_OP` | "Do nothing this tick" | Nothing changes on EKS |
### The SCALE_UP Flow in Detail
Here is exactly what happens when the agent decides to scale up `node-0` (the payments service):
```
HF Spaces AWS EKS
---------- --------
Agent: "SCALE_UP, node-0, parameter=0.5"
|
v
AntiAtroposEnvironment.step()
|
v
KubernetesExecutor.execute_with_metadata()
|
v
_load_node_workload_map()
reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
|
v
_scale_deployment("SCALE_UP", "node-0", 0.5)
|
+-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
| Current replicas = 2
|
+-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
| Desired = min(6, 2 + 1) = 3 <-- max_replicas cap from env var
|
+-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
| body={"spec": {"replicas": 3}})
|
v +---------------------------+
Returns: "Ack: SCALE_UP for node-0 - | K8s creates 1 new pod: |
deployment payments in namespace | payments-newpod-xyz |
prod-sre scaled 2->3" +---------------------------+
```
### The Telemetry Flow in Detail
How the agent reads metrics from the real cluster:
```
EKS Cluster AMP HF Spaces
----------- --- ----------
Workload pods AMP Workspace AntiAtropos
(payments, checkout...) stores all metrics PrometheusClient
| ^ |
| /metrics (scraped every 15s) | |
v | |
Prometheus Agent | |
| | |
| remote-write (SigV4 auth) | |
+-------------------------------------------> |
| |
| HTTPS query |
+------------------------>
(PROMETHEUS_URL env var)
|
v
_fetch_real_metrics()
runs PromQL like:
sum(rate(http_requests_total[1m])) by (pod)
returns: TelemetryRecord for each node
```
---
## The Three Layers of Scaling Caps
This is the most important thing to understand for cost control. There are **three** independent limits:
### Layer 1: Python Code Cap (Soft)
**Where:** `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18.
**How it works:** The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets:
```
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
```
**Can it be bypassed?** Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly.
**Set to:** `6` on HF Spaces.
### Layer 2: Kubernetes ResourceQuota (Hard)
**Where:** `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace.
**How it works:** Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:
```
Error from server (Forbidden): pods "payments-new" is forbidden:
exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
```
**Can it be bypassed?** Only by someone with cluster-admin access who deletes or edits the ResourceQuota.
**Set to:** 30 pods total, 8 CPU, 8GB RAM.
### Layer 3: EKS Node Group Max Size (Hard)
**Where:** `eksctl-cluster.yaml` — `managedNodeGroups[0].maxSize: 4`.
**How it works:** The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.
**Can it be bypassed?** Only by someone editing the node group in the AWS console.
**Set to:** 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).
### How the Three Layers Work Together
```
Agent wants to scale all 5 deployments to 20 replicas each:
Layer 1 (Python cap): 6 replicas max per deployment -> agent gets "unchanged at 6"
5 x 6 = 30 pods maximum
Layer 2 (ResourceQuota): 30 pods max in namespace -> 31st pod is Forbidden
Layer 3 (Node group): 4 nodes max -> if 30 pods don't fit on 4 nodes,
some stay Pending (no cost)
Worst case with all caps: 30 pods on 4 nodes = ~$160/month
Without any caps: 100 pods on 25 nodes = ~$1,800/month
```
---
## The Mapping: Simulator Nodes to Real Deployments
The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to:
```
Simulator Node K8s Deployment Namespace Notes
------------- --------------- --------- -----
node-0 payments prod-sre VIP (4x importance weight)
node-1 checkout prod-sre Critical (no SHED_LOAD)
node-2 catalog prod-sre Critical (no SHED_LOAD)
node-3 cart prod-sre Non-critical (sheddable)
node-4 auth prod-sre Non-critical (sheddable)
```
When the simulator says "SCALE_UP node-0 by 0.5", the system:
1. Looks up node-0 in the workload map -> `payments` in `prod-sre`
2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)`
3. Kubernetes creates/destroys pods to match the new replica count
---
## What Runs Where (Complete List)
### On Hugging Face Spaces
| Component | What It Does | Port |
|---|---|---|
| FastAPI server (`server/app.py`) | HTTP API for the agent | 7860 (via NGINX) |
| Simulator (`simulator.py`) | 5-node microservice cluster simulation | Internal |
| PrometheusClient (`telemetry/prometheus_client.py`) | Queries AMP for real metrics | Outbound HTTPS |
| KubernetesExecutor (`control/kubernetes_executor.py`) | Sends scale commands to EKS | Outbound HTTPS |
| Prometheus metrics exporter | Serves `/metrics` for HF's monitoring | 8000 |
| Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 |
### On AWS EKS
| Component | Namespace | What It Does |
|---|---|---|
| payments Deployment | prod-sre | 2 nginx pods (scales with agent) |
| checkout Deployment | prod-sre | 1 nginx pod (scales with agent) |
| catalog Deployment | prod-sre | 1 nginx pod (scales with agent) |
| cart Deployment | prod-sre | 1 nginx pod (scales with agent) |
| auth Deployment | prod-sre | 1 nginx pod (scales with agent) |
| Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP |
| Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand |
### On AWS Managed Services
| Service | What It Does |
|---|---|
| AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. |
| AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. |
---
## The Simulator vs Real Cluster
AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`:
### Simulated Mode (`simulated`)
Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.
This is the default on HF Spaces without AWS configured.
### Hybrid Mode (`hybrid`)
The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.
### Live Mode (`live`)
The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money.
**Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.**
---
## Cost Flow
Every pod on EKS costs money. Here is how costs flow based on the agent's actions:
```
Agent action: SCALE_UP node-0
-> payments Deployment: replicas 2 -> 5
-> 3 new pods created
-> If existing nodes are full, Cluster Autoscaler adds a node
-> New node = another t3.medium EC2 instance = ~$0.04/hr
-> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota
Agent action: SCALE_DOWN node-3
-> cart Deployment: replicas 4 -> 1
-> 3 pods terminated
-> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
-> One fewer EC2 instance = saves ~$0.04/hr
```
The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:
```
R_t = -(alpha * delta_V + beta * cost + gamma * SLA_violation)
^^^^
beta=0.01 penalizes over-provisioning
```
---
## Quick Reference: Key Files
| File | Purpose |
|---|---|
| `kubernetes_executor.py` | Translates agent actions to K8s API calls |
| `prometheus_client.py` | Queries AMP for real metrics |
| `simulator.py` | 5-node fluid-queue simulation |
| `stability.py` | Lyapunov reward computation |
| `deploy/aws/k8s-workloads.yaml` | The 5 Deployments + ResourceQuota on EKS |
| `deploy/aws/eksctl-cluster.yaml` | EKS cluster definition (nodes, caps) |
| `deploy/aws/prometheus-agent-values.yaml` | Helm config for Prometheus Agent |
| `deploy/aws/generate-kubeconfig.sh` | Creates kubeconfig for HF Spaces |
|