AntiAtropos / deploy /aws /ARCHITECTURE.md

div18

model changes

9db539d 27 days ago

14.9 kB

	# AntiAtropos Architecture Guide

	A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.

	---

	## The Big Picture

	AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.

	The system is split across two platforms:

	```
	Hugging Face Spaces AWS
	===================== ======================
	The "brain" The "muscle"

	AntiAtropos FastAPI server EKS (Kubernetes cluster)
	- Runs the simulator - Runs the actual microservice pods
	- Runs the SRE agent logic - The agent scales these pods
	- Queries Prometheus for metrics - Prometheus Agent scrapes metrics
	- Sends scale commands to K8s - Metrics flow to AMP
	- Grafana (AMG) visualizes it all
	```

	Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.

	---

	## Kubernetes Concepts You Need

	### Pod

	The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").

	Think of it as: one running instance of a service.

	### Deployment

	A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.

	The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down.

	```
	Deployment: payments
	replicas: 3 <-- the agent changes this number
	\|
	+-- Pod: payments-abc123 (running)
	+-- Pod: payments-def456 (running)
	+-- Pod: payments-ghi789 (running)
	```

	The agent scales replicas, not pods. When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods.

	### Service

	A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy.

	### Namespace

	A namespace is a folder for organizing resources. We use:
	- `prod-sre` — where the 5 microservice Deployments live
	- `monitoring` — where the Prometheus Agent pod lives
	- `kube-system` — where AWS/EKS system pods live

	### Node

	A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config).

	```
	EKS Cluster
	Node 1 (t3.medium - 4 vCPU, 8GB RAM)
	Pod: payments-abc123
	Pod: checkout-def456
	Pod: catalog-ghi789
	Pod: prometheus-agent-xyz
	Node 2 (t3.medium - 4 vCPU, 8GB RAM)
	Pod: payments-jkl012 <-- agent scaled payments from 1 to 2
	Pod: cart-mno345
	Pod: auth-pqr678
	```

	### ResourceQuota

	A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.

	---

	## How the SRE Agent Works

	### The Loop

	Every "tick" (one step of the simulation), the agent goes through this cycle:

	```
	1. OBSERVE -- Read telemetry (CPU, latency, queue depth) from Prometheus
	2. DECIDE -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
	3. ACT -- Send the action to KubernetesExecutor
	4. REWARD -- Compute Lyapunov stability reward (was the cluster more or less stable?)
	5. REPEAT
	```

	### How Each Action Works

	\| Action \| What the Agent Decides \| What Happens on EKS \|
	\|---\|---\|---\|
	\| `SCALE_UP` \| "node-0 needs more capacity" \| `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` \|
	\| `SCALE_DOWN` \| "node-3 is over-provisioned" \| `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` \|
	\| `REROUTE_TRAFFIC` \| "Move traffic away from node-2" \| `KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments \|
	\| `SHED_LOAD` \| "Drop 50% of traffic to node-3" \| `KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas` \|
	\| `NO_OP` \| "Do nothing this tick" \| Nothing changes on EKS \|

	### The SCALE_UP Flow in Detail

	Here is exactly what happens when the agent decides to scale up `node-0` (the payments service):

	```
	HF Spaces AWS EKS
	---------- --------

	Agent: "SCALE_UP, node-0, parameter=0.5"
	\|
	v
	AntiAtroposEnvironment.step()
	\|
	v
	KubernetesExecutor.execute_with_metadata()
	\|
	v
	_load_node_workload_map()
	reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
	\|
	v
	_scale_deployment("SCALE_UP", "node-0", 0.5)
	\|
	+-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
	\| Current replicas = 2
	\|
	+-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
	\| Desired = min(6, 2 + 1) = 3 <-- max_replicas cap from env var
	\|
	+-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
	\| body={"spec": {"replicas": 3}})
	\|
	v +---------------------------+
	Returns: "Ack: SCALE_UP for node-0 - \| K8s creates 1 new pod: \|
	deployment payments in namespace \| payments-newpod-xyz \|
	prod-sre scaled 2->3" +---------------------------+
	```

	### The Telemetry Flow in Detail

	How the agent reads metrics from the real cluster:

	```
	EKS Cluster AMP HF Spaces
	----------- --- ----------

	Workload pods AMP Workspace AntiAtropos
	(payments, checkout...) stores all metrics PrometheusClient
	\| ^ \|
	\| /metrics (scraped every 15s) \| \|
	v \| \|
	Prometheus Agent \| \|
	\| \| \|
	\| remote-write (SigV4 auth) \| \|
	+-------------------------------------------> \|
	\| \|
	\| HTTPS query \|
	+------------------------>
	(PROMETHEUS_URL env var)
	\|
	v
	_fetch_real_metrics()
	runs PromQL like:
	sum(rate(http_requests_total[1m])) by (pod)
	returns: TelemetryRecord for each node
	```

	---

	## The Three Layers of Scaling Caps

	This is the most important thing to understand for cost control. There are three independent limits:

	### Layer 1: Python Code Cap (Soft)

	Where: `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18.

	How it works: The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets:

	```
	Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
	```

	Can it be bypassed? Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly.

	Set to: `6` on HF Spaces.

	### Layer 2: Kubernetes ResourceQuota (Hard)

	Where: `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace.

	How it works: Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:

	```
	Error from server (Forbidden): pods "payments-new" is forbidden:
	exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
	```

	Can it be bypassed? Only by someone with cluster-admin access who deletes or edits the ResourceQuota.

	Set to: 30 pods total, 8 CPU, 8GB RAM.

	### Layer 3: EKS Node Group Max Size (Hard)

	Where: `eksctl-cluster.yaml` — `managedNodeGroups[0].maxSize: 4`.

	How it works: The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.

	Can it be bypassed? Only by someone editing the node group in the AWS console.

	Set to: 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).

	### How the Three Layers Work Together

	```
	Agent wants to scale all 5 deployments to 20 replicas each:

	Layer 1 (Python cap): 6 replicas max per deployment -> agent gets "unchanged at 6"
	5 x 6 = 30 pods maximum

	Layer 2 (ResourceQuota): 30 pods max in namespace -> 31st pod is Forbidden

	Layer 3 (Node group): 4 nodes max -> if 30 pods don't fit on 4 nodes,
	some stay Pending (no cost)

	Worst case with all caps: 30 pods on 4 nodes = ~$160/month
	Without any caps: 100 pods on 25 nodes = ~$1,800/month
	```

	---

	## The Mapping: Simulator Nodes to Real Deployments

	The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to:

	```
	Simulator Node K8s Deployment Namespace Notes
	------------- --------------- --------- -----
	node-0 payments prod-sre VIP (4x importance weight)
	node-1 checkout prod-sre Critical (no SHED_LOAD)
	node-2 catalog prod-sre Critical (no SHED_LOAD)
	node-3 cart prod-sre Non-critical (sheddable)
	node-4 auth prod-sre Non-critical (sheddable)
	```

	When the simulator says "SCALE_UP node-0 by 0.5", the system:
	1. Looks up node-0 in the workload map -> `payments` in `prod-sre`
	2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)`
	3. Kubernetes creates/destroys pods to match the new replica count

	---

	## What Runs Where (Complete List)

	### On Hugging Face Spaces

	\| Component \| What It Does \| Port \|
	\|---\|---\|---\|
	\| FastAPI server (`server/app.py`) \| HTTP API for the agent \| 7860 (via NGINX) \|
	\| Simulator (`simulator.py`) \| 5-node microservice cluster simulation \| Internal \|
	\| PrometheusClient (`telemetry/prometheus_client.py`) \| Queries AMP for real metrics \| Outbound HTTPS \|
	\| KubernetesExecutor (`control/kubernetes_executor.py`) \| Sends scale commands to EKS \| Outbound HTTPS \|
	\| Prometheus metrics exporter \| Serves `/metrics` for HF's monitoring \| 8000 \|
	\| Grafana + local Prometheus \| Local dashboards (from the Dockerfile) \| 3000, 9090 \|

	### On AWS EKS

	\| Component \| Namespace \| What It Does \|
	\|---\|---\|---\|
	\| payments Deployment \| prod-sre \| 2 nginx pods (scales with agent) \|
	\| checkout Deployment \| prod-sre \| 1 nginx pod (scales with agent) \|
	\| catalog Deployment \| prod-sre \| 1 nginx pod (scales with agent) \|
	\| cart Deployment \| prod-sre \| 1 nginx pod (scales with agent) \|
	\| auth Deployment \| prod-sre \| 1 nginx pod (scales with agent) \|
	\| Prometheus Agent \| monitoring \| Scrapes workload pods, remote-writes to AMP \|
	\| Cluster Autoscaler \| kube-system \| Adds/removes EC2 nodes based on demand \|

	### On AWS Managed Services

	\| Service \| What It Does \|
	\|---\|---\|
	\| AMP (Amazon Managed Prometheus) \| Stores all metrics. Queried by HF Spaces. \|
	\| AMG (Amazon Managed Grafana) \| Visualizes metrics in dashboards. Accessed via browser. \|

	---

	## The Simulator vs Real Cluster

	AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`:

	### Simulated Mode (`simulated`)

	Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.

	This is the default on HF Spaces without AWS configured.

	### Hybrid Mode (`hybrid`)

	The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.

	### Live Mode (`live`)

	The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money.

	Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.

	---

	## Cost Flow

	Every pod on EKS costs money. Here is how costs flow based on the agent's actions:

	```
	Agent action: SCALE_UP node-0
	-> payments Deployment: replicas 2 -> 5
	-> 3 new pods created
	-> If existing nodes are full, Cluster Autoscaler adds a node
	-> New node = another t3.medium EC2 instance = ~$0.04/hr
	-> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota

	Agent action: SCALE_DOWN node-3
	-> cart Deployment: replicas 4 -> 1
	-> 3 pods terminated
	-> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
	-> One fewer EC2 instance = saves ~$0.04/hr
	```

	The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:

	```
	R_t = -(alpha * delta_V + beta * cost + gamma * SLA_violation)
	^^^^
	beta=0.01 penalizes over-provisioning
	```

	---

	## Quick Reference: Key Files

	\| File \| Purpose \|
	\|---\|---\|
	\| `kubernetes_executor.py` \| Translates agent actions to K8s API calls \|
	\| `prometheus_client.py` \| Queries AMP for real metrics \|
	\| `simulator.py` \| 5-node fluid-queue simulation \|
	\| `stability.py` \| Lyapunov reward computation \|
	\| `deploy/aws/k8s-workloads.yaml` \| The 5 Deployments + ResourceQuota on EKS \|
	\| `deploy/aws/eksctl-cluster.yaml` \| EKS cluster definition (nodes, caps) \|
	\| `deploy/aws/prometheus-agent-values.yaml` \| Helm config for Prometheus Agent \|
	\| `deploy/aws/generate-kubeconfig.sh` \| Creates kubeconfig for HF Spaces \|