AntiAtropos / README.md
Divyansh Agrawal
update readme.md
7d6383a
---
title: AntiAtropos Environment Server
colorFrom: gray
colorTo: red
sdk: docker
pinned: false
app_port: 7860
base_path: /
tags:
- openenv
---
# AntiAtropos: Autonomous SRE Control Environment (OpenEnv)
> **A production-grade RL/agent environment for the future of autonomous DevOps β€” where intelligent agents replace fragile runbooks, reduce on-call toil, and keep infrastructure healthy without human intervention.**
AntiAtropos is an open, high-fidelity environment for training and benchmarking AI agents on site reliability engineering (SRE) β€” the discipline that keeps production infrastructure alive at scale. It models a live five-node microservice cluster operating under realistic production pressures: demand surges, cascading node failures, SLA deadlines, and hard safety constraints on critical services.
This is not a toy grid world or an abstract planning problem. Every action type, every penalty function, and every telemetry field in AntiAtropos was designed to mirror the exact decisions an on-call engineer faces when the PagerDuty alert fires at 3 AM.
## The Problem: Infrastructure Operations Don't Scale With Humans
Modern platform teams operate infrastructure that is orders of magnitude more complex than the teams managing it. The result is a well-documented set of pain points:
- **On-call toil.** Engineers are paged for incidents that follow predictable patterns β€” traffic spikes, memory pressure, node failures β€” and execute the same runbooks repeatedly. This is high-stress, low-leverage work that burns out senior engineers.
- **Reactive, not proactive.** Static autoscaling policies (HPA, VPA) react to thresholds but cannot reason ahead about demand trajectories, reroute traffic away from degrading nodes, or balance cost and reliability over time.
- **Runbook rot.** Documented procedures go stale. Edge cases accumulate. The institutional knowledge that makes incident response fast lives in engineers' heads, not in systems.
AntiAtropos is a training and evaluation ground for agents that solve this problem β€” systems that can observe cluster telemetry, reason about multi-step consequences, and issue control actions that keep services healthy, cost-efficient, and resilient.
## What AntiAtropos Trains Agents to Do
An agent operating in AntiAtropos executes the same core loop a platform engineer runs continuously:
**1. Observe the cluster state.**
The observation space mirrors real Prometheus/Grafana metrics: request rates, p99 latency, error rates, queue backlogs, CPU utilization, and per-node health β€” the same signals that drive every serious SRE incident workflow.
**2. Reason about what is wrong and why.**
The environment implements genuine queueing dynamics with boot delays, traffic reroute decay, and Lyapunov-based stability measurement. Agents that only react to threshold breaches perform poorly; agents that build a causal model of the cluster perform well.
**3. Issue control actions with real operational semantics.**
- `SCALE_UP` β€” expand node capacity (with a realistic `BOOT_DELAY_TICKS = 5` cold-start delay)
- `SCALE_DOWN` β€” reduce capacity and cost
- `REROUTE_TRAFFIC` β€” shift request load away from unhealthy nodes
- `SHED_LOAD` β€” drop a fraction of traffic to protect the cluster (forbidden on critical nodes)
- `NO_OP` β€” hold position when the system is stable
These are not abstract symbols. They map directly to `kubectl scale`, traffic policy overrides, and rate limiter controls used in production Kubernetes environments.
**4. Balance competing objectives across time.**
Uptime vs. cost vs. stability is the fundamental trade-off every platform team navigates. Brute-force overprovisioning fails the cost grader. Underprovisioning fails SLAs. The agent must plan β€” not just react.
**5. Respect hard safety constraints.**
Critical nodes cannot have load shed. Scale operations are bounded. Invalid actions are penalized. AntiAtropos enforces the same guardrails that production runbooks encode, rewarding agents that understand operational boundaries.
## Why This Matters for AIOps
The trajectory of platform engineering is clear: the toil layer gets automated, and engineers move up the stack. AntiAtropos provides the training and evaluation infrastructure to accelerate that transition responsibly:
- **Benchmark before you deploy.** An agent evaluated on AntiAtropos has been tested against capacity ramps, node failures, and burst surges with safety constraints β€” covering the incident categories that account for the majority of real production pages.
- **Dense, informative feedback.** Most production telemetry arrives in sparse, high-dimensional streams. AntiAtropos provides step-level Lyapunov-grounded reward signals that give learning algorithms meaningful gradient information at every tick β€” not just at episode end.
- **Composable with real infrastructure.** The Kubernetes executor (`control/kubernetes_executor.py`) and Prometheus ingestion (`telemetry/prometheus_client.py`) make it possible to wire a trained policy into a real cluster with minimal adaptation, enabling true hybrid-autonomy workflows where the agent handles routine incidents and escalates novel ones.
- **Deterministic grading.** Unlike production incidents where success is hard to measure objectively, AntiAtropos provides a clean `[0.0, 1.0]` composite score per episode β€” making benchmark comparisons across models and policies reproducible and auditable.
## LLM-as-SRE: Zero-Shot Incident Response Evaluation
`inference.py` provides a complete evaluation harness for testing frontier LLMs as zero-shot SRE agents. Set your API key, pick a model, and run β€” the script handles the full episode loop: observation formatting, action parsing, constraint enforcement, and final grading.
```bash
set OPENAI_API_KEY=your_key_here
set MODEL_NAME=gpt-4.1
set ANTIATROPOS_TASK=task-3
python inference.py
```
This makes AntiAtropos a drop-in benchmark for comparing how well different LLMs reason about infrastructure operations β€” a capability that is increasingly relevant as AI models are integrated into on-call tooling, runbook automation, and incident triage systems.
## OpenEnv Specification Compliance
AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
- `Action` model: `SREAction` in `models.py`
- `Observation` model: `ClusterObservation` + `NodeObservation` in `models.py`
- `step(action)` returns observation with reward/done fields
- `reset()` returns initial cluster observation
- `state` is exposed through the OpenEnv `State` object
- `openenv.yaml` is present at repository root
OpenEnv manifest:
- name: `AntiAtropos`
- runtime: `fastapi`
- app: `server.app:app`
- port: `7860`
## Environment Dynamics
### Queueing Model
For each node `i`, AntiAtropos uses a fluid queue update:
`Q_i(t+1) = max(Q_i(t) + lambda_eff_i(t) - mu_i(t), 0)`
where:
- `lambda_eff_i = lambda_incoming_i * (1 - shed_fraction_i)`
- `mu_i = capacity_i * 15` requests/tick (or `0` if node has failed)
### Latency and CPU Utilization
- `cpu_i = lambda_incoming_i / mu_i`
- `latency_i = BASE_LATENCY_MS + LATENCY_STEEPNESS * Q_i`
### Lyapunov Stability
Core stability objective is weighted Lyapunov energy:
`V(s) = sum_i (w_i * Q_i^2)`
VIP/business-critical nodes carry higher weights `w_i`. Drift term:
`DeltaV(t) = V(s_t) - V(s_{t-1})`
### Reward Function
`R_raw_t = -(alpha * DeltaV_t + beta * Cost_t + gamma * SLA_violation_t)`
Default weights: `alpha = 0.002`, `beta = 0.01`, `gamma = 10.0`.
Normalized:
`R_norm_t = sigmoid((R_raw_t - midpoint) / temperature)`
Dense step-level signal β€” not sparse terminal reward β€” that strongly penalizes SLA failures, invalid actions, and destabilizing queue growth.
## Action Space
`SREAction` (`models.py`):
- `action_type`: `NO_OP` | `SCALE_UP` | `SCALE_DOWN` | `REROUTE_TRAFFIC` | `SHED_LOAD`
- `target_node_id`: `node-0` to `node-4`
- `parameter`: bounded float with action-dependent semantics
Safety constraints enforced by `control/validation.py`:
- `SHED_LOAD` is **forbidden on critical nodes** (`node-0`, `node-1`, `node-2`)
- Scale operations are bounded by node min/max capacity
- Invalid actions are counted and penalized in the final score
## Observation Space
`ClusterObservation`:
- Task/mode/episode step metadata
- Active node count, normalized latency, error rate, backlog
- Cost per hour, Lyapunov energy
- SLA and invalid-action counters
- Raw and normalized reward fields
- Per-node `NodeObservation` list
`NodeObservation` (per node):
- Queue depth, latency, incoming request rate
- CPU utilization, health status
- VIP flag and importance weight
## Task Suite
### `task-1` β€” Capacity Ramp (Easy)
Load starts near cluster capacity and ramps over the episode. The agent must proactively scale and contain queue growth without overprovisioning. A clean benchmark for predictive capacity planning β€” the most common form of infrastructure toil.
### `task-2` β€” Fault Tolerance (Medium)
A non-VIP node fails at a randomized tick. Traffic continues hitting failed capacity until the agent detects the failure and responds. Tests reactive incident response: detecting failure signals, rerouting affected traffic, and compensating with scaling β€” under realistic delay constraints.
### `task-3` β€” Stability Under Surge (Hard)
Major traffic surges target non-critical nodes, threatening to cascade. The agent must protect the VIP Payment Gateway (`node-0`). `SHED_LOAD` is forbidden on critical nodes (`node-0`, `node-1`, and `node-2`). The agent must coordinate pre-emptive `SCALE_UP` to absorb the surge before it arrives and use persistent `REROUTE_TRAFFIC` to redirect load, all while maintaining cost discipline. The closest analogue to a real high-severity incident: time pressure, safety constraints, and no single correct action.
## Grading (0.0–1.0)
Computed by `grader.py` β€” deterministic and reproducible:
| Component | Formula | Weight |
|---|---|---|
| Uptime | Fraction of steps with latency ≀ 0.20 and error rate ≀ 0.05 | 0.4 |
| Cost | `exp(-3.0 * over_ratio)` β€” punishes overprovisioning | 0.4 |
| Stability | `1 / (1 + (avg_energy / TARGET_ENERGY)^power)` | 0.2 |
- task-3: cost contribution disabled when uptime `< 0.5`
- Invalid-action penalty: `-0.05` per invalid action
- Final value clipped at `0.0`
## Observability Stack
AntiAtropos ships a full production-style observability stack:
- Prometheus scrapes environment metrics at `GET /metrics`
- Grafana `antiatropos-overview` dashboard: reward trajectory, queue heatmaps, latency timeseries, SLA violations, per-node state, action throughput, executor reliability
- NGINX reverse proxy exposes `/`, `/prometheus/`, and `/grafana/` on port `7860`
- `deploy/entrypoint.sh` boots the full stack in a single container
## Kubernetes Integration
For teams evaluating agents against real infrastructure:
- `control/kubernetes_executor.py` translates `SCALE_UP`/`SCALE_DOWN` into `kubectl` operations on mapped deployments
- Configure via `ANTIATROPOS_WORKLOAD_MAP` or `ANTIATROPOS_NODE_DEPLOYMENT_MAP`
- `telemetry/prometheus_client.py` ingests live PromQL metrics and reconciles them into simulator state via weighted blending β€” enabling a real-environment feedback loop with minimal code change
## Baseline Scores
Reproducible NO-OP baseline over 20 seeded runs (100 steps each):
| Task | Mean Composite | Min | Max |
|---|---:|---:|---:|
| task-1 | 0.6980 | 0.6845 | 0.7171 |
| task-2 | 0.7020 | 0.6400 | 0.7560 |
| task-3 | 0.2063 | 0.1721 | 0.2521 |
Task-3's low baseline score reflects the genuine difficulty of burst surge management under safety constraints β€” and the substantial headroom available for capable agents.
## Setup and Usage
### Local Python
```bash
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Docker
```bash
docker build -t antiatropos:latest .
docker run --rm -p 7860:7860 antiatropos:latest
```
### OpenEnv Validation
```bash
openenv validate
```
### Hugging Face Space Deployment
```bash
openenv push
```
## Project Structure
| Path | Description |
|---|---|
| `models.py` | Typed OpenEnv action/observation models |
| `simulator.py` | Queueing physics, task dynamics, action semantics |
| `stability.py` | Lyapunov/reward math |
| `grader.py` | Deterministic episode scoring |
| `inference.py` | OpenAI-compatible baseline runner |
| `client.py` | OpenEnv client wrapper |
| `openenv.yaml` | Environment manifest |
| `server/AntiAtropos_environment.py` | Environment runtime (`reset`, `step`, state handling) |
| `server/app.py` | FastAPI/OpenEnv app + `/metrics` |
| `control/` | Action validation and Kubernetes executor |
| `telemetry/` | Prometheus ingestion, metric mapping, exporter |
| `deploy/` | Entrypoint, NGINX, Prometheus, Grafana provisioning |
| `Dockerfile`, `server/Dockerfile` | Container build targets |
## Reproducibility
- Containerized execution (`Dockerfile`, `server/Dockerfile`)
- Pinned dependency lockfile (`uv.lock`)
- Deterministic grading equations (`grader.py`)
- Explicit reward equations in code β€” no black-box scoring
- Configurable environment variables for mode, telemetry endpoints, and policy runtime
For fixed-seed studies, use controlled simulator seeding in evaluation harnesses.
## Evaluation Alignment
| Criterion | AntiAtropos |
|---|---|
| Real-world utility | Genuine SRE/platform engineering control task with production-grade operational constraints |
| Task quality | 3 tasks with easy-medium-hard progression mapped to real incident categories |
| Grader quality | Deterministic, interpretable composite score in `[0, 1]` |
| Environment design | Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries |
| Code quality | Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime |
| Novelty | Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design |