File size: 14,900 Bytes
8cd4141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9db539d
 
8cd4141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
# AntiAtropos Architecture Guide

A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.

---

## The Big Picture

AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.

The system is split across two platforms:

```
Hugging Face Spaces                      AWS
=====================                    ======================
The "brain"                              The "muscle"

AntiAtropos FastAPI server               EKS (Kubernetes cluster)
  - Runs the simulator                     - Runs the actual microservice pods
  - Runs the SRE agent logic               - The agent scales these pods
  - Queries Prometheus for metrics         - Prometheus Agent scrapes metrics
  - Sends scale commands to K8s            - Metrics flow to AMP
                                           - Grafana (AMG) visualizes it all
```

Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.

---

## Kubernetes Concepts You Need

### Pod

The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").

Think of it as: one running instance of a service.

### Deployment

A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.

The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down.

```
Deployment: payments
  replicas: 3         <-- the agent changes this number
    |
    +-- Pod: payments-abc123   (running)
    +-- Pod: payments-def456   (running)
    +-- Pod: payments-ghi789   (running)
```

**The agent scales replicas, not pods.** When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods.

### Service

A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy.

### Namespace

A namespace is a folder for organizing resources. We use:
- `prod-sre` — where the 5 microservice Deployments live
- `monitoring` — where the Prometheus Agent pod lives
- `kube-system` — where AWS/EKS system pods live

### Node

A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config).

```
EKS Cluster
  Node 1 (t3.medium - 4 vCPU, 8GB RAM)
    Pod: payments-abc123
    Pod: checkout-def456
    Pod: catalog-ghi789
    Pod: prometheus-agent-xyz
  Node 2 (t3.medium - 4 vCPU, 8GB RAM)
    Pod: payments-jkl012    <-- agent scaled payments from 1 to 2
    Pod: cart-mno345
    Pod: auth-pqr678
```

### ResourceQuota

A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.

---

## How the SRE Agent Works

### The Loop

Every "tick" (one step of the simulation), the agent goes through this cycle:

```
1. OBSERVE  -- Read telemetry (CPU, latency, queue depth) from Prometheus
2. DECIDE   -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
3. ACT      -- Send the action to KubernetesExecutor
4. REWARD   -- Compute Lyapunov stability reward (was the cluster more or less stable?)
5. REPEAT
```

### How Each Action Works

| Action | What the Agent Decides | What Happens on EKS |
|---|---|---|
| `SCALE_UP` | "node-0 needs more capacity" | `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` |
| `SCALE_DOWN` | "node-3 is over-provisioned" | `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` |
| `REROUTE_TRAFFIC` | "Move traffic away from node-2" | `KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments |
| `SHED_LOAD` | "Drop 50% of traffic to node-3" | `KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas` |
| `NO_OP` | "Do nothing this tick" | Nothing changes on EKS |

### The SCALE_UP Flow in Detail

Here is exactly what happens when the agent decides to scale up `node-0` (the payments service):

```
HF Spaces                                    AWS EKS
----------                                   --------

Agent: "SCALE_UP, node-0, parameter=0.5"
  |
  v
AntiAtroposEnvironment.step()
  |
  v
KubernetesExecutor.execute_with_metadata()
  |
  v
_load_node_workload_map()
  reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
  |
  v
_scale_deployment("SCALE_UP", "node-0", 0.5)
  |
  +-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
  |      Current replicas = 2
  |
  +-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
  |      Desired = min(6, 2 + 1) = 3        <-- max_replicas cap from env var
  |
  +-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
  |         body={"spec": {"replicas": 3}})
  |
  v                                                     +---------------------------+
Returns: "Ack: SCALE_UP for node-0 -                    | K8s creates 1 new pod:    |
  deployment payments in namespace                       |   payments-newpod-xyz     |
  prod-sre scaled 2->3"                                  +---------------------------+
```

### The Telemetry Flow in Detail

How the agent reads metrics from the real cluster:

```
EKS Cluster                              AMP                          HF Spaces
-----------                              ---                          ----------

Workload pods                            AMP Workspace                AntiAtropos
(payments, checkout...)                  stores all metrics           PrometheusClient
  |                                           ^                        |
  | /metrics (scraped every 15s)              |                        |
  v                                           |                        |
Prometheus Agent                             |                        |
  |                                           |                        |
  | remote-write (SigV4 auth)                 |                        |
  +------------------------------------------->                        |
                                              |                        |
                                              |  HTTPS query           |
                                              +------------------------>
                                              (PROMETHEUS_URL env var)
                                                                       |
                                                                       v
                                                                 _fetch_real_metrics()
                                                                 runs PromQL like:
                                                                   sum(rate(http_requests_total[1m])) by (pod)
                                                                 returns: TelemetryRecord for each node
```

---

## The Three Layers of Scaling Caps

This is the most important thing to understand for cost control. There are **three** independent limits:

### Layer 1: Python Code Cap (Soft)

**Where:** `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18.

**How it works:** The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets:

```
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
```

**Can it be bypassed?** Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly.

**Set to:** `6` on HF Spaces.

### Layer 2: Kubernetes ResourceQuota (Hard)

**Where:** `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace.

**How it works:** Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:

```
Error from server (Forbidden): pods "payments-new" is forbidden:
exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
```

**Can it be bypassed?** Only by someone with cluster-admin access who deletes or edits the ResourceQuota.

**Set to:** 30 pods total, 8 CPU, 8GB RAM.

### Layer 3: EKS Node Group Max Size (Hard)

**Where:** `eksctl-cluster.yaml` — `managedNodeGroups[0].maxSize: 4`.

**How it works:** The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.

**Can it be bypassed?** Only by someone editing the node group in the AWS console.

**Set to:** 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).

### How the Three Layers Work Together

```
Agent wants to scale all 5 deployments to 20 replicas each:

Layer 1 (Python cap):      6 replicas max per deployment  -> agent gets "unchanged at 6"
                           5 x 6 = 30 pods maximum

Layer 2 (ResourceQuota):   30 pods max in namespace       -> 31st pod is Forbidden

Layer 3 (Node group):      4 nodes max                     -> if 30 pods don't fit on 4 nodes,
                                                            some stay Pending (no cost)

Worst case with all caps:  30 pods on 4 nodes = ~$160/month
Without any caps:          100 pods on 25 nodes = ~$1,800/month
```

---

## The Mapping: Simulator Nodes to Real Deployments

The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to:

```
Simulator Node    K8s Deployment    Namespace    Notes
-------------     ---------------   ---------    -----
node-0            payments          prod-sre     VIP (4x importance weight)
node-1            checkout          prod-sre     Critical (no SHED_LOAD)
node-2            catalog           prod-sre     Critical (no SHED_LOAD)
node-3            cart              prod-sre     Non-critical (sheddable)
node-4            auth              prod-sre     Non-critical (sheddable)
```

When the simulator says "SCALE_UP node-0 by 0.5", the system:
1. Looks up node-0 in the workload map -> `payments` in `prod-sre`
2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)`
3. Kubernetes creates/destroys pods to match the new replica count

---

## What Runs Where (Complete List)

### On Hugging Face Spaces

| Component | What It Does | Port |
|---|---|---|
| FastAPI server (`server/app.py`) | HTTP API for the agent | 7860 (via NGINX) |
| Simulator (`simulator.py`) | 5-node microservice cluster simulation | Internal |
| PrometheusClient (`telemetry/prometheus_client.py`) | Queries AMP for real metrics | Outbound HTTPS |
| KubernetesExecutor (`control/kubernetes_executor.py`) | Sends scale commands to EKS | Outbound HTTPS |
| Prometheus metrics exporter | Serves `/metrics` for HF's monitoring | 8000 |
| Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 |

### On AWS EKS

| Component | Namespace | What It Does |
|---|---|---|
| payments Deployment | prod-sre | 2 nginx pods (scales with agent) |
| checkout Deployment | prod-sre | 1 nginx pod (scales with agent) |
| catalog Deployment | prod-sre | 1 nginx pod (scales with agent) |
| cart Deployment | prod-sre | 1 nginx pod (scales with agent) |
| auth Deployment | prod-sre | 1 nginx pod (scales with agent) |
| Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP |
| Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand |

### On AWS Managed Services

| Service | What It Does |
|---|---|
| AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. |
| AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. |

---

## The Simulator vs Real Cluster

AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`:

### Simulated Mode (`simulated`)

Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.

This is the default on HF Spaces without AWS configured.

### Hybrid Mode (`hybrid`)

The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.

### Live Mode (`live`)

The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money.

**Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.**

---

## Cost Flow

Every pod on EKS costs money. Here is how costs flow based on the agent's actions:

```
Agent action: SCALE_UP node-0
  -> payments Deployment: replicas 2 -> 5
  -> 3 new pods created
  -> If existing nodes are full, Cluster Autoscaler adds a node
  -> New node = another t3.medium EC2 instance = ~$0.04/hr
  -> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota

Agent action: SCALE_DOWN node-3
  -> cart Deployment: replicas 4 -> 1
  -> 3 pods terminated
  -> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
  -> One fewer EC2 instance = saves ~$0.04/hr
```

The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:

```
R_t = -(alpha * delta_V  +  beta * cost  +  gamma * SLA_violation)
                                  ^^^^
                           beta=0.01 penalizes over-provisioning
```

---

## Quick Reference: Key Files

| File | Purpose |
|---|---|
| `kubernetes_executor.py` | Translates agent actions to K8s API calls |
| `prometheus_client.py` | Queries AMP for real metrics |
| `simulator.py` | 5-node fluid-queue simulation |
| `stability.py` | Lyapunov reward computation |
| `deploy/aws/k8s-workloads.yaml` | The 5 Deployments + ResourceQuota on EKS |
| `deploy/aws/eksctl-cluster.yaml` | EKS cluster definition (nodes, caps) |
| `deploy/aws/prometheus-agent-values.yaml` | Helm config for Prometheus Agent |
| `deploy/aws/generate-kubeconfig.sh` | Creates kubeconfig for HF Spaces |