| --- |
| title: AntiAtropos Environment Server |
| colorFrom: gray |
| colorTo: red |
| sdk: docker |
| pinned: false |
| app_port: 7860 |
| base_path: / |
| tags: |
| - openenv |
| --- |
| |
| # AntiAtropos: Autonomous SRE Control Environment (OpenEnv) |
|
|
| > **A production-grade RL/agent environment for the future of autonomous DevOps β where intelligent agents replace fragile runbooks, reduce on-call toil, and keep infrastructure healthy without human intervention.** |
|
|
| AntiAtropos is an open, high-fidelity environment for training and benchmarking AI agents on site reliability engineering (SRE) β the discipline that keeps production infrastructure alive at scale. It models a live five-node microservice cluster operating under realistic production pressures: demand surges, cascading node failures, SLA deadlines, and hard safety constraints on critical services. |
|
|
| This is not a toy grid world or an abstract planning problem. Every action type, every penalty function, and every telemetry field in AntiAtropos was designed to mirror the exact decisions an on-call engineer faces when the PagerDuty alert fires at 3 AM. |
|
|
| ## The Problem: Infrastructure Operations Don't Scale With Humans |
|
|
| Modern platform teams operate infrastructure that is orders of magnitude more complex than the teams managing it. The result is a well-documented set of pain points: |
|
|
| - **On-call toil.** Engineers are paged for incidents that follow predictable patterns β traffic spikes, memory pressure, node failures β and execute the same runbooks repeatedly. This is high-stress, low-leverage work that burns out senior engineers. |
| - **Reactive, not proactive.** Static autoscaling policies (HPA, VPA) react to thresholds but cannot reason ahead about demand trajectories, reroute traffic away from degrading nodes, or balance cost and reliability over time. |
| - **Runbook rot.** Documented procedures go stale. Edge cases accumulate. The institutional knowledge that makes incident response fast lives in engineers' heads, not in systems. |
|
|
| AntiAtropos is a training and evaluation ground for agents that solve this problem β systems that can observe cluster telemetry, reason about multi-step consequences, and issue control actions that keep services healthy, cost-efficient, and resilient. |
|
|
| ## What AntiAtropos Trains Agents to Do |
|
|
| An agent operating in AntiAtropos executes the same core loop a platform engineer runs continuously: |
|
|
| **1. Observe the cluster state.** |
| The observation space mirrors real Prometheus/Grafana metrics: request rates, p99 latency, error rates, queue backlogs, CPU utilization, and per-node health β the same signals that drive every serious SRE incident workflow. |
|
|
| **2. Reason about what is wrong and why.** |
| The environment implements genuine queueing dynamics with boot delays, traffic reroute decay, and Lyapunov-based stability measurement. Agents that only react to threshold breaches perform poorly; agents that build a causal model of the cluster perform well. |
|
|
| **3. Issue control actions with real operational semantics.** |
| - `SCALE_UP` β expand node capacity (with a realistic `BOOT_DELAY_TICKS = 5` cold-start delay) |
| - `SCALE_DOWN` β reduce capacity and cost |
| - `REROUTE_TRAFFIC` β shift request load away from unhealthy nodes |
| - `SHED_LOAD` β drop a fraction of traffic to protect the cluster (forbidden on critical nodes) |
| - `NO_OP` β hold position when the system is stable |
|
|
| These are not abstract symbols. They map directly to `kubectl scale`, traffic policy overrides, and rate limiter controls used in production Kubernetes environments. |
|
|
| **4. Balance competing objectives across time.** |
| Uptime vs. cost vs. stability is the fundamental trade-off every platform team navigates. Brute-force overprovisioning fails the cost grader. Underprovisioning fails SLAs. The agent must plan β not just react. |
|
|
| **5. Respect hard safety constraints.** |
| Critical nodes cannot have load shed. Scale operations are bounded. Invalid actions are penalized. AntiAtropos enforces the same guardrails that production runbooks encode, rewarding agents that understand operational boundaries. |
|
|
| ## Why This Matters for AIOps |
|
|
| The trajectory of platform engineering is clear: the toil layer gets automated, and engineers move up the stack. AntiAtropos provides the training and evaluation infrastructure to accelerate that transition responsibly: |
|
|
| - **Benchmark before you deploy.** An agent evaluated on AntiAtropos has been tested against capacity ramps, node failures, and burst surges with safety constraints β covering the incident categories that account for the majority of real production pages. |
| - **Dense, informative feedback.** Most production telemetry arrives in sparse, high-dimensional streams. AntiAtropos provides step-level Lyapunov-grounded reward signals that give learning algorithms meaningful gradient information at every tick β not just at episode end. |
| - **Composable with real infrastructure.** The Kubernetes executor (`control/kubernetes_executor.py`) and Prometheus ingestion (`telemetry/prometheus_client.py`) make it possible to wire a trained policy into a real cluster with minimal adaptation, enabling true hybrid-autonomy workflows where the agent handles routine incidents and escalates novel ones. |
| - **Deterministic grading.** Unlike production incidents where success is hard to measure objectively, AntiAtropos provides a clean `[0.0, 1.0]` composite score per episode β making benchmark comparisons across models and policies reproducible and auditable. |
|
|
| ## LLM-as-SRE: Zero-Shot Incident Response Evaluation |
|
|
| `inference.py` provides a complete evaluation harness for testing frontier LLMs as zero-shot SRE agents. Set your API key, pick a model, and run β the script handles the full episode loop: observation formatting, action parsing, constraint enforcement, and final grading. |
|
|
| ```bash |
| set OPENAI_API_KEY=your_key_here |
| set MODEL_NAME=gpt-4.1 |
| set ANTIATROPOS_TASK=task-3 |
| python inference.py |
| ``` |
|
|
| This makes AntiAtropos a drop-in benchmark for comparing how well different LLMs reason about infrastructure operations β a capability that is increasingly relevant as AI models are integrated into on-call tooling, runbook automation, and incident triage systems. |
|
|
| ## OpenEnv Specification Compliance |
|
|
| AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server: |
| - `Action` model: `SREAction` in `models.py` |
| - `Observation` model: `ClusterObservation` + `NodeObservation` in `models.py` |
| - `step(action)` returns observation with reward/done fields |
| - `reset()` returns initial cluster observation |
| - `state` is exposed through the OpenEnv `State` object |
| - `openenv.yaml` is present at repository root |
|
|
| OpenEnv manifest: |
| - name: `AntiAtropos` |
| - runtime: `fastapi` |
| - app: `server.app:app` |
| - port: `7860` |
|
|
| ## Environment Dynamics |
|
|
| ### Queueing Model |
|
|
| For each node `i`, AntiAtropos uses a fluid queue update: |
|
|
| `Q_i(t+1) = max(Q_i(t) + lambda_eff_i(t) - mu_i(t), 0)` |
|
|
| where: |
| - `lambda_eff_i = lambda_incoming_i * (1 - shed_fraction_i)` |
| - `mu_i = capacity_i * 15` requests/tick (or `0` if node has failed) |
|
|
| ### Latency and CPU Utilization |
|
|
| - `cpu_i = lambda_incoming_i / mu_i` |
| - `latency_i = BASE_LATENCY_MS + LATENCY_STEEPNESS * Q_i` |
|
|
| ### Lyapunov Stability |
|
|
| Core stability objective is weighted Lyapunov energy: |
|
|
| `V(s) = sum_i (w_i * Q_i^2)` |
|
|
| VIP/business-critical nodes carry higher weights `w_i`. Drift term: |
|
|
| `DeltaV(t) = V(s_t) - V(s_{t-1})` |
|
|
| ### Reward Function |
|
|
| `R_raw_t = -(alpha * DeltaV_t + beta * Cost_t + gamma * SLA_violation_t)` |
|
|
| Default weights: `alpha = 0.002`, `beta = 0.01`, `gamma = 10.0`. |
|
|
| Normalized: |
|
|
| `R_norm_t = sigmoid((R_raw_t - midpoint) / temperature)` |
|
|
| Dense step-level signal β not sparse terminal reward β that strongly penalizes SLA failures, invalid actions, and destabilizing queue growth. |
|
|
| ## Action Space |
|
|
| `SREAction` (`models.py`): |
| - `action_type`: `NO_OP` | `SCALE_UP` | `SCALE_DOWN` | `REROUTE_TRAFFIC` | `SHED_LOAD` |
| - `target_node_id`: `node-0` to `node-4` |
| - `parameter`: bounded float with action-dependent semantics |
|
|
| Safety constraints enforced by `control/validation.py`: |
| - `SHED_LOAD` is **forbidden on critical nodes** (`node-0`, `node-1`, `node-2`) |
| - Scale operations are bounded by node min/max capacity |
| - Invalid actions are counted and penalized in the final score |
|
|
| ## Observation Space |
|
|
| `ClusterObservation`: |
| - Task/mode/episode step metadata |
| - Active node count, normalized latency, error rate, backlog |
| - Cost per hour, Lyapunov energy |
| - SLA and invalid-action counters |
| - Raw and normalized reward fields |
| - Per-node `NodeObservation` list |
|
|
| `NodeObservation` (per node): |
| - Queue depth, latency, incoming request rate |
| - CPU utilization, health status |
| - VIP flag and importance weight |
|
|
| ## Task Suite |
|
|
| ### `task-1` β Capacity Ramp (Easy) |
| Load starts near cluster capacity and ramps over the episode. The agent must proactively scale and contain queue growth without overprovisioning. A clean benchmark for predictive capacity planning β the most common form of infrastructure toil. |
|
|
| ### `task-2` β Fault Tolerance (Medium) |
| A non-VIP node fails at a randomized tick. Traffic continues hitting failed capacity until the agent detects the failure and responds. Tests reactive incident response: detecting failure signals, rerouting affected traffic, and compensating with scaling β under realistic delay constraints. |
|
|
| ### `task-3` β Stability Under Surge (Hard) |
| Major traffic surges target non-critical nodes, threatening to cascade. The agent must protect the VIP Payment Gateway (`node-0`). `SHED_LOAD` is forbidden on critical nodes (`node-0`, `node-1`, and `node-2`). The agent must coordinate pre-emptive `SCALE_UP` to absorb the surge before it arrives and use persistent `REROUTE_TRAFFIC` to redirect load, all while maintaining cost discipline. The closest analogue to a real high-severity incident: time pressure, safety constraints, and no single correct action. |
|
|
| ## Grading (0.0β1.0) |
|
|
| Computed by `grader.py` β deterministic and reproducible: |
|
|
| | Component | Formula | Weight | |
| |---|---|---| |
| | Uptime | Fraction of steps with latency β€ 0.20 and error rate β€ 0.05 | 0.4 | |
| | Cost | `exp(-3.0 * over_ratio)` β punishes overprovisioning | 0.4 | |
| | Stability | `1 / (1 + (avg_energy / TARGET_ENERGY)^power)` | 0.2 | |
|
|
| - task-3: cost contribution disabled when uptime `< 0.5` |
| - Invalid-action penalty: `-0.05` per invalid action |
| - Final value clipped at `0.0` |
|
|
| ## Observability Stack |
|
|
| AntiAtropos ships a full production-style observability stack: |
| - Prometheus scrapes environment metrics at `GET /metrics` |
| - Grafana `antiatropos-overview` dashboard: reward trajectory, queue heatmaps, latency timeseries, SLA violations, per-node state, action throughput, executor reliability |
| - NGINX reverse proxy exposes `/`, `/prometheus/`, and `/grafana/` on port `7860` |
| - `deploy/entrypoint.sh` boots the full stack in a single container |
|
|
| ## Kubernetes Integration |
|
|
| For teams evaluating agents against real infrastructure: |
| - `control/kubernetes_executor.py` translates `SCALE_UP`/`SCALE_DOWN` into `kubectl` operations on mapped deployments |
| - Configure via `ANTIATROPOS_WORKLOAD_MAP` or `ANTIATROPOS_NODE_DEPLOYMENT_MAP` |
| - `telemetry/prometheus_client.py` ingests live PromQL metrics and reconciles them into simulator state via weighted blending β enabling a real-environment feedback loop with minimal code change |
|
|
| ## Baseline Scores |
|
|
| Reproducible NO-OP baseline over 20 seeded runs (100 steps each): |
|
|
| | Task | Mean Composite | Min | Max | |
| |---|---:|---:|---:| |
| | task-1 | 0.6980 | 0.6845 | 0.7171 | |
| | task-2 | 0.7020 | 0.6400 | 0.7560 | |
| | task-3 | 0.2063 | 0.1721 | 0.2521 | |
|
|
| Task-3's low baseline score reflects the genuine difficulty of burst surge management under safety constraints β and the substantial headroom available for capable agents. |
|
|
| ## Setup and Usage |
|
|
| ### Local Python |
|
|
| ```bash |
| pip install -e . |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 |
| ``` |
|
|
| ### Docker |
|
|
| ```bash |
| docker build -t antiatropos:latest . |
| docker run --rm -p 7860:7860 antiatropos:latest |
| ``` |
|
|
| ### OpenEnv Validation |
|
|
| ```bash |
| openenv validate |
| ``` |
|
|
| ### Hugging Face Space Deployment |
|
|
| ```bash |
| openenv push |
| ``` |
|
|
| ## Project Structure |
|
|
| | Path | Description | |
| |---|---| |
| | `models.py` | Typed OpenEnv action/observation models | |
| | `simulator.py` | Queueing physics, task dynamics, action semantics | |
| | `stability.py` | Lyapunov/reward math | |
| | `grader.py` | Deterministic episode scoring | |
| | `inference.py` | OpenAI-compatible baseline runner | |
| | `client.py` | OpenEnv client wrapper | |
| | `openenv.yaml` | Environment manifest | |
| | `server/AntiAtropos_environment.py` | Environment runtime (`reset`, `step`, state handling) | |
| | `server/app.py` | FastAPI/OpenEnv app + `/metrics` | |
| | `control/` | Action validation and Kubernetes executor | |
| | `telemetry/` | Prometheus ingestion, metric mapping, exporter | |
| | `deploy/` | Entrypoint, NGINX, Prometheus, Grafana provisioning | |
| | `Dockerfile`, `server/Dockerfile` | Container build targets | |
|
|
| ## Reproducibility |
|
|
| - Containerized execution (`Dockerfile`, `server/Dockerfile`) |
| - Pinned dependency lockfile (`uv.lock`) |
| - Deterministic grading equations (`grader.py`) |
| - Explicit reward equations in code β no black-box scoring |
| - Configurable environment variables for mode, telemetry endpoints, and policy runtime |
|
|
| For fixed-seed studies, use controlled simulator seeding in evaluation harnesses. |
|
|
| ## Evaluation Alignment |
|
|
| | Criterion | AntiAtropos | |
| |---|---| |
| | Real-world utility | Genuine SRE/platform engineering control task with production-grade operational constraints | |
| | Task quality | 3 tasks with easy-medium-hard progression mapped to real incident categories | |
| | Grader quality | Deterministic, interpretable composite score in `[0, 1]` | |
| | Environment design | Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries | |
| | Code quality | Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime | |
| | Novelty | Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design | |
|
|