AntiAtropos Architecture Guide
A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.
The Big Picture
AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.
The system is split across two platforms:
Hugging Face Spaces AWS
===================== ======================
The "brain" The "muscle"
AntiAtropos FastAPI server EKS (Kubernetes cluster)
- Runs the simulator - Runs the actual microservice pods
- Runs the SRE agent logic - The agent scales these pods
- Queries Prometheus for metrics - Prometheus Agent scrapes metrics
- Sends scale commands to K8s - Metrics flow to AMP
- Grafana (AMG) visualizes it all
Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.
Kubernetes Concepts You Need
Pod
The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").
Think of it as: one running instance of a service.
Deployment
A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.
The key field is spec.replicas — this is the number the SRE agent changes when it scales a service up or down.
Deployment: payments
replicas: 3 <-- the agent changes this number
|
+-- Pod: payments-abc123 (running)
+-- Pod: payments-def456 (running)
+-- Pod: payments-ghi789 (running)
The agent scales replicas, not pods. When it sets replicas: 5, Kubernetes creates 5 pods. When it sets replicas: 2, Kubernetes kills 3 pods.
Service
A Service gives pods a stable network name. Instead of connecting to payments-abc123 directly (which changes when the pod is recreated), you connect to payments (the Service), which routes to whichever pods are healthy.
Namespace
A namespace is a folder for organizing resources. We use:
prod-sre— where the 5 microservice Deployments livemonitoring— where the Prometheus Agent pod liveskube-system— where AWS/EKS system pods live
Node
A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to maxSize: 4 in our config).
EKS Cluster
Node 1 (t3.medium - 4 vCPU, 8GB RAM)
Pod: payments-abc123
Pod: checkout-def456
Pod: catalog-ghi789
Pod: prometheus-agent-xyz
Node 2 (t3.medium - 4 vCPU, 8GB RAM)
Pod: payments-jkl012 <-- agent scaled payments from 1 to 2
Pod: cart-mno345
Pod: auth-pqr678
ResourceQuota
A hard limit on how many resources a namespace can use. We set one on prod-sre that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.
How the SRE Agent Works
The Loop
Every "tick" (one step of the simulation), the agent goes through this cycle:
1. OBSERVE -- Read telemetry (CPU, latency, queue depth) from Prometheus
2. DECIDE -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
3. ACT -- Send the action to KubernetesExecutor
4. REWARD -- Compute Lyapunov stability reward (was the cluster more or less stable?)
5. REPEAT
How Each Action Works
| Action | What the Agent Decides | What Happens on EKS |
|---|---|---|
SCALE_UP |
"node-0 needs more capacity" | KubernetesExecutor patches payments Deployment: replicas: 2 -> 5 |
SCALE_DOWN |
"node-3 is over-provisioned" | KubernetesExecutor patches cart Deployment: replicas: 4 -> 1 |
REROUTE_TRAFFIC |
"Move traffic away from node-2" | KubernetesExecutor scales DOWN target deployment and redistributes replicas to healthy peer deployments |
SHED_LOAD |
"Drop 50% of traffic to node-3" | KubernetesExecutor scales DOWN target deployment by parameter * current_replicas |
NO_OP |
"Do nothing this tick" | Nothing changes on EKS |
The SCALE_UP Flow in Detail
Here is exactly what happens when the agent decides to scale up node-0 (the payments service):
HF Spaces AWS EKS
---------- --------
Agent: "SCALE_UP, node-0, parameter=0.5"
|
v
AntiAtroposEnvironment.step()
|
v
KubernetesExecutor.execute_with_metadata()
|
v
_load_node_workload_map()
reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
|
v
_scale_deployment("SCALE_UP", "node-0", 0.5)
|
+-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
| Current replicas = 2
|
+-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
| Desired = min(6, 2 + 1) = 3 <-- max_replicas cap from env var
|
+-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
| body={"spec": {"replicas": 3}})
|
v +---------------------------+
Returns: "Ack: SCALE_UP for node-0 - | K8s creates 1 new pod: |
deployment payments in namespace | payments-newpod-xyz |
prod-sre scaled 2->3" +---------------------------+
The Telemetry Flow in Detail
How the agent reads metrics from the real cluster:
EKS Cluster AMP HF Spaces
----------- --- ----------
Workload pods AMP Workspace AntiAtropos
(payments, checkout...) stores all metrics PrometheusClient
| ^ |
| /metrics (scraped every 15s) | |
v | |
Prometheus Agent | |
| | |
| remote-write (SigV4 auth) | |
+-------------------------------------------> |
| |
| HTTPS query |
+------------------------>
(PROMETHEUS_URL env var)
|
v
_fetch_real_metrics()
runs PromQL like:
sum(rate(http_requests_total[1m])) by (pod)
returns: TelemetryRecord for each node
The Three Layers of Scaling Caps
This is the most important thing to understand for cost control. There are three independent limits:
Layer 1: Python Code Cap (Soft)
Where: ANTIATROPOS_MAX_REPLICAS env var on HF Spaces, read by kubernetes_executor.py line 18.
How it works: The _scale_deployment() method calculates desired = min(self.max_replicas, current + delta). If the agent tries to scale above 6, it gets:
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
Can it be bypassed? Yes. A bug in the code, or someone running kubectl scale deployment payments --replicas=50 directly.
Set to: 6 on HF Spaces.
Layer 2: Kubernetes ResourceQuota (Hard)
Where: k8s-workloads.yaml — ResourceQuota on the prod-sre namespace.
How it works: Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:
Error from server (Forbidden): pods "payments-new" is forbidden:
exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30
Can it be bypassed? Only by someone with cluster-admin access who deletes or edits the ResourceQuota.
Set to: 30 pods total, 8 CPU, 8GB RAM.
Layer 3: EKS Node Group Max Size (Hard)
Where: eksctl-cluster.yaml — managedNodeGroups[0].maxSize: 4.
How it works: The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.
Can it be bypassed? Only by someone editing the node group in the AWS console.
Set to: 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).
How the Three Layers Work Together
Agent wants to scale all 5 deployments to 20 replicas each:
Layer 1 (Python cap): 6 replicas max per deployment -> agent gets "unchanged at 6"
5 x 6 = 30 pods maximum
Layer 2 (ResourceQuota): 30 pods max in namespace -> 31st pod is Forbidden
Layer 3 (Node group): 4 nodes max -> if 30 pods don't fit on 4 nodes,
some stay Pending (no cost)
Worst case with all caps: 30 pods on 4 nodes = ~$160/month
Without any caps: 100 pods on 25 nodes = ~$1,800/month
The Mapping: Simulator Nodes to Real Deployments
The simulator has 5 abstract nodes (node-0 through node-4). The ANTIATROPOS_WORKLOAD_MAP env var tells the system which K8s Deployment each simulator node maps to:
Simulator Node K8s Deployment Namespace Notes
------------- --------------- --------- -----
node-0 payments prod-sre VIP (4x importance weight)
node-1 checkout prod-sre Critical (no SHED_LOAD)
node-2 catalog prod-sre Critical (no SHED_LOAD)
node-3 cart prod-sre Non-critical (sheddable)
node-4 auth prod-sre Non-critical (sheddable)
When the simulator says "SCALE_UP node-0 by 0.5", the system:
- Looks up node-0 in the workload map ->
paymentsinprod-sre - Calls
patch_namespaced_deployment_scale("payments", "prod-sre", ...) - Kubernetes creates/destroys pods to match the new replica count
What Runs Where (Complete List)
On Hugging Face Spaces
| Component | What It Does | Port |
|---|---|---|
FastAPI server (server/app.py) |
HTTP API for the agent | 7860 (via NGINX) |
Simulator (simulator.py) |
5-node microservice cluster simulation | Internal |
PrometheusClient (telemetry/prometheus_client.py) |
Queries AMP for real metrics | Outbound HTTPS |
KubernetesExecutor (control/kubernetes_executor.py) |
Sends scale commands to EKS | Outbound HTTPS |
| Prometheus metrics exporter | Serves /metrics for HF's monitoring |
8000 |
| Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 |
On AWS EKS
| Component | Namespace | What It Does |
|---|---|---|
| payments Deployment | prod-sre | 2 nginx pods (scales with agent) |
| checkout Deployment | prod-sre | 1 nginx pod (scales with agent) |
| catalog Deployment | prod-sre | 1 nginx pod (scales with agent) |
| cart Deployment | prod-sre | 1 nginx pod (scales with agent) |
| auth Deployment | prod-sre | 1 nginx pod (scales with agent) |
| Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP |
| Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand |
On AWS Managed Services
| Service | What It Does |
|---|---|
| AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. |
| AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. |
The Simulator vs Real Cluster
AntiAtropos has three modes controlled by ANTIATROPOS_ENV_MODE:
Simulated Mode (simulated)
Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.
This is the default on HF Spaces without AWS configured.
Hybrid Mode (hybrid)
The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says payments pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.
Live Mode (live)
The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says SCALE_UP, actual pods get created on actual EC2 instances that cost actual money.
Set ANTIATROPOS_ENV_MODE=live on HF Spaces to enable this.
Cost Flow
Every pod on EKS costs money. Here is how costs flow based on the agent's actions:
Agent action: SCALE_UP node-0
-> payments Deployment: replicas 2 -> 5
-> 3 new pods created
-> If existing nodes are full, Cluster Autoscaler adds a node
-> New node = another t3.medium EC2 instance = ~$0.04/hr
-> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota
Agent action: SCALE_DOWN node-3
-> cart Deployment: replicas 4 -> 1
-> 3 pods terminated
-> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
-> One fewer EC2 instance = saves ~$0.04/hr
The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:
R_t = -(alpha * delta_V + beta * cost + gamma * SLA_violation)
^^^^
beta=0.01 penalizes over-provisioning
Quick Reference: Key Files
| File | Purpose |
|---|---|
kubernetes_executor.py |
Translates agent actions to K8s API calls |
prometheus_client.py |
Queries AMP for real metrics |
simulator.py |
5-node fluid-queue simulation |
stability.py |
Lyapunov reward computation |
deploy/aws/k8s-workloads.yaml |
The 5 Deployments + ResourceQuota on EKS |
deploy/aws/eksctl-cluster.yaml |
EKS cluster definition (nodes, caps) |
deploy/aws/prometheus-agent-values.yaml |
Helm config for Prometheus Agent |
deploy/aws/generate-kubeconfig.sh |
Creates kubeconfig for HF Spaces |