AntiAtropos / deploy /aws /ARCHITECTURE.md

div18

model changes

9db539d 26 days ago

14.9 kB

AntiAtropos Architecture Guide

A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes.

The Big Picture

AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly.

The system is split across two platforms:

Hugging Face Spaces                      AWS
=====================                    ======================
The "brain"                              The "muscle"

AntiAtropos FastAPI server               EKS (Kubernetes cluster)
  - Runs the simulator                     - Runs the actual microservice pods
  - Runs the SRE agent logic               - The agent scales these pods
  - Queries Prometheus for metrics         - Prometheus Agent scrapes metrics
  - Sends scale commands to K8s            - Metrics flow to AMP
                                           - Grafana (AMG) visualizes it all

Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on.

Kubernetes Concepts You Need

Pod

The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout").

Think of it as: one running instance of a service.

Deployment

A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it.

The key field is spec.replicas — this is the number the SRE agent changes when it scales a service up or down.

Deployment: payments
  replicas: 3         <-- the agent changes this number
    |
    +-- Pod: payments-abc123   (running)
    +-- Pod: payments-def456   (running)
    +-- Pod: payments-ghi789   (running)

The agent scales replicas, not pods. When it sets replicas: 5, Kubernetes creates 5 pods. When it sets replicas: 2, Kubernetes kills 3 pods.

Service

A Service gives pods a stable network name. Instead of connecting to payments-abc123 directly (which changes when the pod is recreated), you connect to payments (the Service), which routes to whichever pods are healthy.

Namespace

A namespace is a folder for organizing resources. We use:

prod-sre — where the 5 microservice Deployments live
monitoring — where the Prometheus Agent pod lives
kube-system — where AWS/EKS system pods live

Node

A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to maxSize: 4 in our config).

EKS Cluster
  Node 1 (t3.medium - 4 vCPU, 8GB RAM)
    Pod: payments-abc123
    Pod: checkout-def456
    Pod: catalog-ghi789
    Pod: prometheus-agent-xyz
  Node 2 (t3.medium - 4 vCPU, 8GB RAM)
    Pod: payments-jkl012    <-- agent scaled payments from 1 to 2
    Pod: cart-mno345
    Pod: auth-pqr678

ResourceQuota

A hard limit on how many resources a namespace can use. We set one on prod-sre that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods.

How the SRE Agent Works

The Loop

Every "tick" (one step of the simulation), the agent goes through this cycle:

1. OBSERVE  -- Read telemetry (CPU, latency, queue depth) from Prometheus
2. DECIDE   -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP)
3. ACT      -- Send the action to KubernetesExecutor
4. REWARD   -- Compute Lyapunov stability reward (was the cluster more or less stable?)
5. REPEAT

How Each Action Works

Action	What the Agent Decides	What Happens on EKS
`SCALE_UP`	"node-0 needs more capacity"	`KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5`
`SCALE_DOWN`	"node-3 is over-provisioned"	`KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1`
`REROUTE_TRAFFIC`	"Move traffic away from node-2"	`KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments
`SHED_LOAD`	"Drop 50% of traffic to node-3"	`KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas`
`NO_OP`	"Do nothing this tick"	Nothing changes on EKS

The SCALE_UP Flow in Detail

Here is exactly what happens when the agent decides to scale up node-0 (the payments service):

HF Spaces                                    AWS EKS
----------                                   --------

Agent: "SCALE_UP, node-0, parameter=0.5"
  |
  v
AntiAtroposEnvironment.step()
  |
  v
KubernetesExecutor.execute_with_metadata()
  |
  v
_load_node_workload_map()
  reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"}
  |
  v
_scale_deployment("SCALE_UP", "node-0", 0.5)
  |
  +-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre")
  |      Current replicas = 2
  |
  +-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1
  |      Desired = min(6, 2 + 1) = 3        <-- max_replicas cap from env var
  |
  +-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre",
  |         body={"spec": {"replicas": 3}})
  |
  v                                                     +---------------------------+
Returns: "Ack: SCALE_UP for node-0 -                    | K8s creates 1 new pod:    |
  deployment payments in namespace                       |   payments-newpod-xyz     |
  prod-sre scaled 2->3"                                  +---------------------------+

The Telemetry Flow in Detail

How the agent reads metrics from the real cluster:

EKS Cluster                              AMP                          HF Spaces
-----------                              ---                          ----------

Workload pods                            AMP Workspace                AntiAtropos
(payments, checkout...)                  stores all metrics           PrometheusClient
  |                                           ^                        |
  | /metrics (scraped every 15s)              |                        |
  v                                           |                        |
Prometheus Agent                             |                        |
  |                                           |                        |
  | remote-write (SigV4 auth)                 |                        |
  +------------------------------------------->                        |
                                              |                        |
                                              |  HTTPS query           |
                                              +------------------------>
                                              (PROMETHEUS_URL env var)
                                                                       |
                                                                       v
                                                                 _fetch_real_metrics()
                                                                 runs PromQL like:
                                                                   sum(rate(http_requests_total[1m])) by (pod)
                                                                 returns: TelemetryRecord for each node

The Three Layers of Scaling Caps

This is the most important thing to understand for cost control. There are three independent limits:

Layer 1: Python Code Cap (Soft)

Where: ANTIATROPOS_MAX_REPLICAS env var on HF Spaces, read by kubernetes_executor.py line 18.

How it works: The _scale_deployment() method calculates desired = min(self.max_replicas, current + delta). If the agent tries to scale above 6, it gets:

Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)

Can it be bypassed? Yes. A bug in the code, or someone running kubectl scale deployment payments --replicas=50 directly.

Set to: 6 on HF Spaces.

Layer 2: Kubernetes ResourceQuota (Hard)

Where: k8s-workloads.yaml — ResourceQuota on the prod-sre namespace.

How it works: Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st:

Error from server (Forbidden): pods "payments-new" is forbidden:
exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30

Can it be bypassed? Only by someone with cluster-admin access who deletes or edits the ResourceQuota.

Set to: 30 pods total, 8 CPU, 8GB RAM.

Layer 3: EKS Node Group Max Size (Hard)

Where: eksctl-cluster.yaml — managedNodeGroups[0].maxSize: 4.

How it works: The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait.

Can it be bypassed? Only by someone editing the node group in the AWS console.

Set to: 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max).

How the Three Layers Work Together

Agent wants to scale all 5 deployments to 20 replicas each:

Layer 1 (Python cap):      6 replicas max per deployment  -> agent gets "unchanged at 6"
                           5 x 6 = 30 pods maximum

Layer 2 (ResourceQuota):   30 pods max in namespace       -> 31st pod is Forbidden

Layer 3 (Node group):      4 nodes max                     -> if 30 pods don't fit on 4 nodes,
                                                            some stay Pending (no cost)

Worst case with all caps:  30 pods on 4 nodes = ~$160/month
Without any caps:          100 pods on 25 nodes = ~$1,800/month

The Mapping: Simulator Nodes to Real Deployments

The simulator has 5 abstract nodes (node-0 through node-4). The ANTIATROPOS_WORKLOAD_MAP env var tells the system which K8s Deployment each simulator node maps to:

Simulator Node    K8s Deployment    Namespace    Notes
-------------     ---------------   ---------    -----
node-0            payments          prod-sre     VIP (4x importance weight)
node-1            checkout          prod-sre     Critical (no SHED_LOAD)
node-2            catalog           prod-sre     Critical (no SHED_LOAD)
node-3            cart              prod-sre     Non-critical (sheddable)
node-4            auth              prod-sre     Non-critical (sheddable)

When the simulator says "SCALE_UP node-0 by 0.5", the system:

Looks up node-0 in the workload map -> payments in prod-sre
Calls patch_namespaced_deployment_scale("payments", "prod-sre", ...)
Kubernetes creates/destroys pods to match the new replica count

What Runs Where (Complete List)

On Hugging Face Spaces

Component	What It Does	Port
FastAPI server (`server/app.py`)	HTTP API for the agent	7860 (via NGINX)
Simulator (`simulator.py`)	5-node microservice cluster simulation	Internal
PrometheusClient (`telemetry/prometheus_client.py`)	Queries AMP for real metrics	Outbound HTTPS
KubernetesExecutor (`control/kubernetes_executor.py`)	Sends scale commands to EKS	Outbound HTTPS
Prometheus metrics exporter	Serves `/metrics` for HF's monitoring	8000
Grafana + local Prometheus	Local dashboards (from the Dockerfile)	3000, 9090

On AWS EKS

Component	Namespace	What It Does
payments Deployment	prod-sre	2 nginx pods (scales with agent)
checkout Deployment	prod-sre	1 nginx pod (scales with agent)
catalog Deployment	prod-sre	1 nginx pod (scales with agent)
cart Deployment	prod-sre	1 nginx pod (scales with agent)
auth Deployment	prod-sre	1 nginx pod (scales with agent)
Prometheus Agent	monitoring	Scrapes workload pods, remote-writes to AMP
Cluster Autoscaler	kube-system	Adds/removes EC2 nodes based on demand

On AWS Managed Services

Service	What It Does
AMP (Amazon Managed Prometheus)	Stores all metrics. Queried by HF Spaces.
AMG (Amazon Managed Grafana)	Visualizes metrics in dashboards. Accessed via browser.

The Simulator vs Real Cluster

AntiAtropos has three modes controlled by ANTIATROPOS_ENV_MODE:

Simulated Mode (`simulated`)

Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox.

This is the default on HF Spaces without AWS configured.

Hybrid Mode (`hybrid`)

The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says payments pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods.

Live Mode (`live`)

The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says SCALE_UP, actual pods get created on actual EC2 instances that cost actual money.

Set ANTIATROPOS_ENV_MODE=live on HF Spaces to enable this.

Cost Flow

Every pod on EKS costs money. Here is how costs flow based on the agent's actions:

Agent action: SCALE_UP node-0
  -> payments Deployment: replicas 2 -> 5
  -> 3 new pods created
  -> If existing nodes are full, Cluster Autoscaler adds a node
  -> New node = another t3.medium EC2 instance = ~$0.04/hr
  -> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota

Agent action: SCALE_DOWN node-3
  -> cart Deployment: replicas 4 -> 1
  -> 3 pods terminated
  -> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min)
  -> One fewer EC2 instance = saves ~$0.04/hr

The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently:

R_t = -(alpha * delta_V  +  beta * cost  +  gamma * SLA_violation)
                                  ^^^^
                           beta=0.01 penalizes over-provisioning

Quick Reference: Key Files

File	Purpose
`kubernetes_executor.py`	Translates agent actions to K8s API calls
`prometheus_client.py`	Queries AMP for real metrics
`simulator.py`	5-node fluid-queue simulation
`stability.py`	Lyapunov reward computation
`deploy/aws/k8s-workloads.yaml`	The 5 Deployments + ResourceQuota on EKS
`deploy/aws/eksctl-cluster.yaml`	EKS cluster definition (nodes, caps)
`deploy/aws/prometheus-agent-values.yaml`	Helm config for Prometheus Agent
`deploy/aws/generate-kubeconfig.sh`	Creates kubeconfig for HF Spaces