Divyansh Agrawal

update readme.md

7d6383a about 2 months ago

14 kB

	---
	title: AntiAtropos Environment Server
	colorFrom: gray
	colorTo: red
	sdk: docker
	pinned: false
	app_port: 7860
	base_path: /
	tags:
	- openenv
	---

	# AntiAtropos: Autonomous SRE Control Environment (OpenEnv)

	> A production-grade RL/agent environment for the future of autonomous DevOps — where intelligent agents replace fragile runbooks, reduce on-call toil, and keep infrastructure healthy without human intervention.

	AntiAtropos is an open, high-fidelity environment for training and benchmarking AI agents on site reliability engineering (SRE) — the discipline that keeps production infrastructure alive at scale. It models a live five-node microservice cluster operating under realistic production pressures: demand surges, cascading node failures, SLA deadlines, and hard safety constraints on critical services.

	This is not a toy grid world or an abstract planning problem. Every action type, every penalty function, and every telemetry field in AntiAtropos was designed to mirror the exact decisions an on-call engineer faces when the PagerDuty alert fires at 3 AM.

	## The Problem: Infrastructure Operations Don't Scale With Humans

	Modern platform teams operate infrastructure that is orders of magnitude more complex than the teams managing it. The result is a well-documented set of pain points:

	- On-call toil. Engineers are paged for incidents that follow predictable patterns — traffic spikes, memory pressure, node failures — and execute the same runbooks repeatedly. This is high-stress, low-leverage work that burns out senior engineers.
	- Reactive, not proactive. Static autoscaling policies (HPA, VPA) react to thresholds but cannot reason ahead about demand trajectories, reroute traffic away from degrading nodes, or balance cost and reliability over time.
	- Runbook rot. Documented procedures go stale. Edge cases accumulate. The institutional knowledge that makes incident response fast lives in engineers' heads, not in systems.

	AntiAtropos is a training and evaluation ground for agents that solve this problem — systems that can observe cluster telemetry, reason about multi-step consequences, and issue control actions that keep services healthy, cost-efficient, and resilient.

	## What AntiAtropos Trains Agents to Do

	An agent operating in AntiAtropos executes the same core loop a platform engineer runs continuously:

	1. Observe the cluster state.
	The observation space mirrors real Prometheus/Grafana metrics: request rates, p99 latency, error rates, queue backlogs, CPU utilization, and per-node health — the same signals that drive every serious SRE incident workflow.

	2. Reason about what is wrong and why.
	The environment implements genuine queueing dynamics with boot delays, traffic reroute decay, and Lyapunov-based stability measurement. Agents that only react to threshold breaches perform poorly; agents that build a causal model of the cluster perform well.

	3. Issue control actions with real operational semantics.
	- `SCALE_UP` — expand node capacity (with a realistic `BOOT_DELAY_TICKS = 5` cold-start delay)
	- `SCALE_DOWN` — reduce capacity and cost
	- `REROUTE_TRAFFIC` — shift request load away from unhealthy nodes
	- `SHED_LOAD` — drop a fraction of traffic to protect the cluster (forbidden on critical nodes)
	- `NO_OP` — hold position when the system is stable

	These are not abstract symbols. They map directly to `kubectl scale`, traffic policy overrides, and rate limiter controls used in production Kubernetes environments.

	4. Balance competing objectives across time.
	Uptime vs. cost vs. stability is the fundamental trade-off every platform team navigates. Brute-force overprovisioning fails the cost grader. Underprovisioning fails SLAs. The agent must plan — not just react.

	5. Respect hard safety constraints.
	Critical nodes cannot have load shed. Scale operations are bounded. Invalid actions are penalized. AntiAtropos enforces the same guardrails that production runbooks encode, rewarding agents that understand operational boundaries.

	## Why This Matters for AIOps

	The trajectory of platform engineering is clear: the toil layer gets automated, and engineers move up the stack. AntiAtropos provides the training and evaluation infrastructure to accelerate that transition responsibly:

	- Benchmark before you deploy. An agent evaluated on AntiAtropos has been tested against capacity ramps, node failures, and burst surges with safety constraints — covering the incident categories that account for the majority of real production pages.
	- Dense, informative feedback. Most production telemetry arrives in sparse, high-dimensional streams. AntiAtropos provides step-level Lyapunov-grounded reward signals that give learning algorithms meaningful gradient information at every tick — not just at episode end.
	- Composable with real infrastructure. The Kubernetes executor (`control/kubernetes_executor.py`) and Prometheus ingestion (`telemetry/prometheus_client.py`) make it possible to wire a trained policy into a real cluster with minimal adaptation, enabling true hybrid-autonomy workflows where the agent handles routine incidents and escalates novel ones.
	- Deterministic grading. Unlike production incidents where success is hard to measure objectively, AntiAtropos provides a clean `[0.0, 1.0]` composite score per episode — making benchmark comparisons across models and policies reproducible and auditable.

	## LLM-as-SRE: Zero-Shot Incident Response Evaluation

	`inference.py` provides a complete evaluation harness for testing frontier LLMs as zero-shot SRE agents. Set your API key, pick a model, and run — the script handles the full episode loop: observation formatting, action parsing, constraint enforcement, and final grading.

	```bash
	set OPENAI_API_KEY=your_key_here
	set MODEL_NAME=gpt-4.1
	set ANTIATROPOS_TASK=task-3
	python inference.py
	```

	This makes AntiAtropos a drop-in benchmark for comparing how well different LLMs reason about infrastructure operations — a capability that is increasingly relevant as AI models are integrated into on-call tooling, runbook automation, and incident triage systems.

	## OpenEnv Specification Compliance

	AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
	- `Action` model: `SREAction` in `models.py`
	- `Observation` model: `ClusterObservation` + `NodeObservation` in `models.py`
	- `step(action)` returns observation with reward/done fields
	- `reset()` returns initial cluster observation
	- `state` is exposed through the OpenEnv `State` object
	- `openenv.yaml` is present at repository root

	OpenEnv manifest:
	- name: `AntiAtropos`
	- runtime: `fastapi`
	- app: `server.app:app`
	- port: `7860`

	## Environment Dynamics

	### Queueing Model

	For each node `i`, AntiAtropos uses a fluid queue update:

	`Q_i(t+1) = max(Q_i(t) + lambda_eff_i(t) - mu_i(t), 0)`

	where:
	- `lambda_eff_i = lambda_incoming_i * (1 - shed_fraction_i)`
	- `mu_i = capacity_i * 15` requests/tick (or `0` if node has failed)

	### Latency and CPU Utilization

	- `cpu_i = lambda_incoming_i / mu_i`
	- `latency_i = BASE_LATENCY_MS + LATENCY_STEEPNESS * Q_i`

	### Lyapunov Stability

	Core stability objective is weighted Lyapunov energy:

	`V(s) = sum_i (w_i * Q_i^2)`

	VIP/business-critical nodes carry higher weights `w_i`. Drift term:

	`DeltaV(t) = V(s_t) - V(s_{t-1})`

	### Reward Function

	`R_raw_t = -(alpha * DeltaV_t + beta * Cost_t + gamma * SLA_violation_t)`

	Default weights: `alpha = 0.002`, `beta = 0.01`, `gamma = 10.0`.

	Normalized:

	`R_norm_t = sigmoid((R_raw_t - midpoint) / temperature)`

	Dense step-level signal — not sparse terminal reward — that strongly penalizes SLA failures, invalid actions, and destabilizing queue growth.

	## Action Space

	`SREAction` (`models.py`):
	- `action_type`: `NO_OP` \| `SCALE_UP` \| `SCALE_DOWN` \| `REROUTE_TRAFFIC` \| `SHED_LOAD`
	- `target_node_id`: `node-0` to `node-4`
	- `parameter`: bounded float with action-dependent semantics

	Safety constraints enforced by `control/validation.py`:
	- `SHED_LOAD` is forbidden on critical nodes (`node-0`, `node-1`, `node-2`)
	- Scale operations are bounded by node min/max capacity
	- Invalid actions are counted and penalized in the final score

	## Observation Space

	`ClusterObservation`:
	- Task/mode/episode step metadata
	- Active node count, normalized latency, error rate, backlog
	- Cost per hour, Lyapunov energy
	- SLA and invalid-action counters
	- Raw and normalized reward fields
	- Per-node `NodeObservation` list

	`NodeObservation` (per node):
	- Queue depth, latency, incoming request rate
	- CPU utilization, health status
	- VIP flag and importance weight

	## Task Suite

	### `task-1` — Capacity Ramp (Easy)
	Load starts near cluster capacity and ramps over the episode. The agent must proactively scale and contain queue growth without overprovisioning. A clean benchmark for predictive capacity planning — the most common form of infrastructure toil.

	### `task-2` — Fault Tolerance (Medium)
	A non-VIP node fails at a randomized tick. Traffic continues hitting failed capacity until the agent detects the failure and responds. Tests reactive incident response: detecting failure signals, rerouting affected traffic, and compensating with scaling — under realistic delay constraints.

	### `task-3` — Stability Under Surge (Hard)
	Major traffic surges target non-critical nodes, threatening to cascade. The agent must protect the VIP Payment Gateway (`node-0`). `SHED_LOAD` is forbidden on critical nodes (`node-0`, `node-1`, and `node-2`). The agent must coordinate pre-emptive `SCALE_UP` to absorb the surge before it arrives and use persistent `REROUTE_TRAFFIC` to redirect load, all while maintaining cost discipline. The closest analogue to a real high-severity incident: time pressure, safety constraints, and no single correct action.

	## Grading (0.0–1.0)

	Computed by `grader.py` — deterministic and reproducible:

	\| Component \| Formula \| Weight \|
	\|---\|---\|---\|
	\| Uptime \| Fraction of steps with latency ≤ 0.20 and error rate ≤ 0.05 \| 0.4 \|
	\| Cost \| `exp(-3.0 * over_ratio)` — punishes overprovisioning \| 0.4 \|
	\| Stability \| `1 / (1 + (avg_energy / TARGET_ENERGY)^power)` \| 0.2 \|

	- task-3: cost contribution disabled when uptime `< 0.5`
	- Invalid-action penalty: `-0.05` per invalid action
	- Final value clipped at `0.0`

	## Observability Stack

	AntiAtropos ships a full production-style observability stack:
	- Prometheus scrapes environment metrics at `GET /metrics`
	- Grafana `antiatropos-overview` dashboard: reward trajectory, queue heatmaps, latency timeseries, SLA violations, per-node state, action throughput, executor reliability
	- NGINX reverse proxy exposes `/`, `/prometheus/`, and `/grafana/` on port `7860`
	- `deploy/entrypoint.sh` boots the full stack in a single container

	## Kubernetes Integration

	For teams evaluating agents against real infrastructure:
	- `control/kubernetes_executor.py` translates `SCALE_UP`/`SCALE_DOWN` into `kubectl` operations on mapped deployments
	- Configure via `ANTIATROPOS_WORKLOAD_MAP` or `ANTIATROPOS_NODE_DEPLOYMENT_MAP`
	- `telemetry/prometheus_client.py` ingests live PromQL metrics and reconciles them into simulator state via weighted blending — enabling a real-environment feedback loop with minimal code change

	## Baseline Scores

	Reproducible NO-OP baseline over 20 seeded runs (100 steps each):

	\| Task \| Mean Composite \| Min \| Max \|
	\|---\|---:\|---:\|---:\|
	\| task-1 \| 0.6980 \| 0.6845 \| 0.7171 \|
	\| task-2 \| 0.7020 \| 0.6400 \| 0.7560 \|
	\| task-3 \| 0.2063 \| 0.1721 \| 0.2521 \|

	Task-3's low baseline score reflects the genuine difficulty of burst surge management under safety constraints — and the substantial headroom available for capable agents.

	## Setup and Usage

	### Local Python

	```bash
	pip install -e .
	uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	### Docker

	```bash
	docker build -t antiatropos:latest .
	docker run --rm -p 7860:7860 antiatropos:latest
	```

	### OpenEnv Validation

	```bash
	openenv validate
	```

	### Hugging Face Space Deployment

	```bash
	openenv push
	```

	## Project Structure

	\| Path \| Description \|
	\|---\|---\|
	\| `models.py` \| Typed OpenEnv action/observation models \|
	\| `simulator.py` \| Queueing physics, task dynamics, action semantics \|
	\| `stability.py` \| Lyapunov/reward math \|
	\| `grader.py` \| Deterministic episode scoring \|
	\| `inference.py` \| OpenAI-compatible baseline runner \|
	\| `client.py` \| OpenEnv client wrapper \|
	\| `openenv.yaml` \| Environment manifest \|
	\| `server/AntiAtropos_environment.py` \| Environment runtime (`reset`, `step`, state handling) \|
	\| `server/app.py` \| FastAPI/OpenEnv app + `/metrics` \|
	\| `control/` \| Action validation and Kubernetes executor \|
	\| `telemetry/` \| Prometheus ingestion, metric mapping, exporter \|
	\| `deploy/` \| Entrypoint, NGINX, Prometheus, Grafana provisioning \|
	\| `Dockerfile`, `server/Dockerfile` \| Container build targets \|

	## Reproducibility

	- Containerized execution (`Dockerfile`, `server/Dockerfile`)
	- Pinned dependency lockfile (`uv.lock`)
	- Deterministic grading equations (`grader.py`)
	- Explicit reward equations in code — no black-box scoring
	- Configurable environment variables for mode, telemetry endpoints, and policy runtime

	For fixed-seed studies, use controlled simulator seeding in evaluation harnesses.

	## Evaluation Alignment

	\| Criterion \| AntiAtropos \|
	\|---\|---\|
	\| Real-world utility \| Genuine SRE/platform engineering control task with production-grade operational constraints \|
	\| Task quality \| 3 tasks with easy-medium-hard progression mapped to real incident categories \|
	\| Grader quality \| Deterministic, interpretable composite score in `[0, 1]` \|
	\| Environment design \| Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries \|
	\| Code quality \| Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime \|
	\| Novelty \| Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design \|