File size: 14,900 Bytes

# AntiAtropos Architecture Guide

A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.

---

## The Big Picture

AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.

The system is split across two platforms:

```
Hugging Face Spaces                      AWS
=====================                    ======================
The "brain"                              The "muscle"

AntiAtropos FastAPI server               EKS (Kubernetes cluster)
  - Runs the simulator                     - Runs the actual microservice pods
  - Runs the SRE agent logic               - The agent scales these pods
  - Queries Prometheus for metrics         - Prometheus Agent scrapes metrics
  - Sends scale commands to K8s            - Metrics flow to AMP
                                           - Grafana (AMG) visualizes it all
```

Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.

---

## Kubernetes Concepts You Need

### Pod

The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").

Think of it as: one running instance of a service.

### Deployment

A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.

The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down.

```
Deployment: payments
  replicas: 3         <-- the agent changes this number
    |
    +-- Pod: payments-abc123   (running)
    +-- Pod: payments-def456   (running)
    +-- Pod: payments-ghi789   (running)
```

**The agent scales replicas, not pods.** When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods.

### Service

A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy.

### Namespace

A namespace is a folder for organizing resources. We use:
- `prod-sre` — where the 5 microservice Deployments live
- `monitoring` — where the Prometheus Agent pod lives
- `kube-system` — where AWS/EKS system pods live

### Node

A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config).

```
EKS Cluster
  Node 1 (t3.medium - 4 vCPU, 8GB RAM)
    Pod: payments-abc123
    Pod: checkout-def456
    Pod: catalog-ghi789
    Pod: prometheus-agent-xyz
  Node 2 (t3.medium - 4 vCPU, 8GB RAM)
    Pod: payments-jkl012    <-- agent scaled payments from 1 to 2
    Pod: cart-mno345
    Pod: auth-pqr678
```

### ResourceQuota

A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.

---

## How the SRE Agent Works

### The Loop

Every "tick" (one step of the simulation), the agent goes through this cycle:

```
1. OBSERVE  -- Read telemetry (CPU, latency, queue depth) from Prometheus
2. DECIDE   -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
3. ACT      -- Send the action to KubernetesExecutor
4. REWARD   -- Compute Lyapunov stability reward (was the cluster more or less stable?)
5. REPEAT
```

### How Each Action Works

| Action | What the Agent Decides | What Happens on EKS |
|---|---|---|
| `SCALE_UP` | "node-0 needs more capacity" | `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` |
| `SCALE_DOWN` | "node-3 is over-provisioned" | `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` |
| `REROUTE_TRAFFIC` | "Move traffic away from node-2" | `KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments |
| `SHED_LOAD` | "Drop 50% of traffic to node-3" | `KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas` |
| `NO_OP` | "Do nothing this tick" | Nothing changes on EKS |

### The SCALE_UP Flow in Detail

Here is exactly what happens when the agent decides to scale up `node-0` (the payments service):

```
HF Spaces                                    AWS EKS
----------                                   --------

Agent: "SCALE_UP, node-0, parameter=0.5"
  |
  v
AntiAtroposEnvironment.step()
  |
  v
KubernetesExecutor.execute_with_metadata()
  |
  v
_load_node_workload_map()
  reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
  |
  v
_scale_deployment("SCALE_UP", "node-0", 0.5)
  |
  +-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
  |      Current replicas = 2
  |
  +-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
  |      Desired = min(6, 2 + 1) = 3        <-- max_replicas cap from env var
  |
  +-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
  |         body={"spec": {"replicas": 3}})
  |
  v                                                     +---------------------------+
Returns: "Ack: SCALE_UP for node-0 -                    | K8s creates 1 new pod:    |
  deployment payments in namespace                       |   payments-newpod-xyz     |
  prod-sre scaled 2->3"                                  +---------------------------+
```

### The Telemetry Flow in Detail

How the agent reads metrics from the real cluster:

```
EKS Cluster                              AMP                          HF Spaces
-----------                              ---                          ----------

Workload pods                            AMP Workspace                AntiAtropos
(payments, checkout...)                  stores all metrics           PrometheusClient
  |                                           ^                        |
  | /metrics (scraped every 15s)              |                        |
  v                                           |                        |
Prometheus Agent                             |                        |
  |                                           |                        |
  | remote-write (SigV4 auth)                 |                        |
  +------------------------------------------->                        |
                                              |                        |
                                              |  HTTPS query           |
                                              +------------------------>
                                              (PROMETHEUS_URL env var)
                                                                       |
                                                                       v
                                                                 _fetch_real_metrics()
                                                                 runs PromQL like:
                                                                   sum(rate(http_requests_total[1m])) by (pod)
                                                                 returns: TelemetryRecord for each node
```

---

## The Three Layers of Scaling Caps

This is the most important thing to understand for cost control. There are **three** independent limits:

### Layer 1: Python Code Cap (Soft)

**Where:** `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18.

**How it works:** The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets:

```
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
```

**Can it be bypassed?** Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly.

**Set to:** `6` on HF Spaces.

### Layer 2: Kubernetes ResourceQuota (Hard)

**Where:** `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace.

**How it works:** Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:

```
Error from server (Forbidden): pods "payments-new" is forbidden:
exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
```

**Can it be bypassed?** Only by someone with cluster-admin access who deletes or edits the ResourceQuota.

**Set to:** 30 pods total, 8 CPU, 8GB RAM.

### Layer 3: EKS Node Group Max Size (Hard)

**Where:** `eksctl-cluster.yaml` — `managedNodeGroups[0].maxSize: 4`.

**How it works:** The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.

**Can it be bypassed?** Only by someone editing the node group in the AWS console.

**Set to:** 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).

### How the Three Layers Work Together

```
Agent wants to scale all 5 deployments to 20 replicas each:

Layer 1 (Python cap):      6 replicas max per deployment  -> agent gets "unchanged at 6"
                           5 x 6 = 30 pods maximum

Layer 2 (ResourceQuota):   30 pods max in namespace       -> 31st pod is Forbidden

Layer 3 (Node group):      4 nodes max                     -> if 30 pods don't fit on 4 nodes,
                                                            some stay Pending (no cost)

Worst case with all caps:  30 pods on 4 nodes = ~$160/month
Without any caps:          100 pods on 25 nodes = ~$1,800/month
```

---

## The Mapping: Simulator Nodes to Real Deployments

The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to:

```
Simulator Node    K8s Deployment    Namespace    Notes
-------------     ---------------   ---------    -----
node-0            payments          prod-sre     VIP (4x importance weight)
node-1            checkout          prod-sre     Critical (no SHED_LOAD)
node-2            catalog           prod-sre     Critical (no SHED_LOAD)
node-3            cart              prod-sre     Non-critical (sheddable)
node-4            auth              prod-sre     Non-critical (sheddable)
```

When the simulator says "SCALE_UP node-0 by 0.5", the system:
1. Looks up node-0 in the workload map -> `payments` in `prod-sre`
2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)`
3. Kubernetes creates/destroys pods to match the new replica count

---

## What Runs Where (Complete List)

### On Hugging Face Spaces

| Component | What It Does | Port |
|---|---|---|
| FastAPI server (`server/app.py`) | HTTP API for the agent | 7860 (via NGINX) |
| Simulator (`simulator.py`) | 5-node microservice cluster simulation | Internal |
| PrometheusClient (`telemetry/prometheus_client.py`) | Queries AMP for real metrics | Outbound HTTPS |
| KubernetesExecutor (`control/kubernetes_executor.py`) | Sends scale commands to EKS | Outbound HTTPS |
| Prometheus metrics exporter | Serves `/metrics` for HF's monitoring | 8000 |
| Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 |

### On AWS EKS

| Component | Namespace | What It Does |
|---|---|---|
| payments Deployment | prod-sre | 2 nginx pods (scales with agent) |
| checkout Deployment | prod-sre | 1 nginx pod (scales with agent) |
| catalog Deployment | prod-sre | 1 nginx pod (scales with agent) |
| cart Deployment | prod-sre | 1 nginx pod (scales with agent) |
| auth Deployment | prod-sre | 1 nginx pod (scales with agent) |
| Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP |
| Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand |

### On AWS Managed Services

| Service | What It Does |
|---|---|
| AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. |
| AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. |

---

## The Simulator vs Real Cluster

AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`:

### Simulated Mode (`simulated`)

Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.

This is the default on HF Spaces without AWS configured.

### Hybrid Mode (`hybrid`)

The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.

### Live Mode (`live`)

The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money.

**Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.**

---

## Cost Flow

Every pod on EKS costs money. Here is how costs flow based on the agent's actions:

```
Agent action: SCALE_UP node-0
  -> payments Deployment: replicas 2 -> 5
  -> 3 new pods created
  -> If existing nodes are full, Cluster Autoscaler adds a node
  -> New node = another t3.medium EC2 instance = ~$0.04/hr
  -> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota

Agent action: SCALE_DOWN node-3
  -> cart Deployment: replicas 4 -> 1
  -> 3 pods terminated
  -> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
  -> One fewer EC2 instance = saves ~$0.04/hr
```

The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:

```
R_t = -(alpha * delta_V  +  beta * cost  +  gamma * SLA_violation)
                                  ^^^^
                           beta=0.01 penalizes over-provisioning
```

---

## Quick Reference: Key Files

| File | Purpose |
|---|---|
| `kubernetes_executor.py` | Translates agent actions to K8s API calls |
| `prometheus_client.py` | Queries AMP for real metrics |
| `simulator.py` | 5-node fluid-queue simulation |
| `stability.py` | Lyapunov reward computation |
| `deploy/aws/k8s-workloads.yaml` | The 5 Deployments + ResourceQuota on EKS |
| `deploy/aws/eksctl-cluster.yaml` | EKS cluster definition (nodes, caps) |
| `deploy/aws/prometheus-agent-values.yaml` | Helm config for Prometheus Agent |
| `deploy/aws/generate-kubeconfig.sh` | Creates kubeconfig for HF Spaces |