AntiAtropos: Autonomous SRE Control Environment (OpenEnv)
A production-grade RL/agent environment for the future of autonomous DevOps β where intelligent agents replace fragile runbooks, reduce on-call toil, and keep infrastructure healthy without human intervention.
AntiAtropos is an open, high-fidelity environment for training and benchmarking AI agents on site reliability engineering (SRE) β the discipline that keeps production infrastructure alive at scale. It models a live five-node microservice cluster operating under realistic production pressures: demand surges, cascading node failures, SLA deadlines, and hard safety constraints on critical services.
This is not a toy grid world or an abstract planning problem. Every action type, every penalty function, and every telemetry field in AntiAtropos was designed to mirror the exact decisions an on-call engineer faces when the PagerDuty alert fires at 3 AM.
The Problem: Infrastructure Operations Don't Scale With Humans
Modern platform teams operate infrastructure that is orders of magnitude more complex than the teams managing it. The result is a well-documented set of pain points:
- On-call toil. Engineers are paged for incidents that follow predictable patterns β traffic spikes, memory pressure, node failures β and execute the same runbooks repeatedly. This is high-stress, low-leverage work that burns out senior engineers.
- Reactive, not proactive. Static autoscaling policies (HPA, VPA) react to thresholds but cannot reason ahead about demand trajectories, reroute traffic away from degrading nodes, or balance cost and reliability over time.
- Runbook rot. Documented procedures go stale. Edge cases accumulate. The institutional knowledge that makes incident response fast lives in engineers' heads, not in systems.
AntiAtropos is a training and evaluation ground for agents that solve this problem β systems that can observe cluster telemetry, reason about multi-step consequences, and issue control actions that keep services healthy, cost-efficient, and resilient.
What AntiAtropos Trains Agents to Do
An agent operating in AntiAtropos executes the same core loop a platform engineer runs continuously:
1. Observe the cluster state. The observation space mirrors real Prometheus/Grafana metrics: request rates, p99 latency, error rates, queue backlogs, CPU utilization, and per-node health β the same signals that drive every serious SRE incident workflow.
2. Reason about what is wrong and why. The environment implements genuine queueing dynamics with boot delays, traffic reroute decay, and Lyapunov-based stability measurement. Agents that only react to threshold breaches perform poorly; agents that build a causal model of the cluster perform well.
3. Issue control actions with real operational semantics.
SCALE_UPβ expand node capacity (with a realisticBOOT_DELAY_TICKS = 5cold-start delay)SCALE_DOWNβ reduce capacity and costREROUTE_TRAFFICβ shift request load away from unhealthy nodesSHED_LOADβ drop a fraction of traffic to protect the cluster (forbidden on critical nodes)NO_OPβ hold position when the system is stable
These are not abstract symbols. They map directly to kubectl scale, traffic policy overrides, and rate limiter controls used in production Kubernetes environments.
4. Balance competing objectives across time. Uptime vs. cost vs. stability is the fundamental trade-off every platform team navigates. Brute-force overprovisioning fails the cost grader. Underprovisioning fails SLAs. The agent must plan β not just react.
5. Respect hard safety constraints. Critical nodes cannot have load shed. Scale operations are bounded. Invalid actions are penalized. AntiAtropos enforces the same guardrails that production runbooks encode, rewarding agents that understand operational boundaries.
Why This Matters for AIOps
The trajectory of platform engineering is clear: the toil layer gets automated, and engineers move up the stack. AntiAtropos provides the training and evaluation infrastructure to accelerate that transition responsibly:
- Benchmark before you deploy. An agent evaluated on AntiAtropos has been tested against capacity ramps, node failures, and burst surges with safety constraints β covering the incident categories that account for the majority of real production pages.
- Dense, informative feedback. Most production telemetry arrives in sparse, high-dimensional streams. AntiAtropos provides step-level Lyapunov-grounded reward signals that give learning algorithms meaningful gradient information at every tick β not just at episode end.
- Composable with real infrastructure. The Kubernetes executor (
control/kubernetes_executor.py) and Prometheus ingestion (telemetry/prometheus_client.py) make it possible to wire a trained policy into a real cluster with minimal adaptation, enabling true hybrid-autonomy workflows where the agent handles routine incidents and escalates novel ones. - Deterministic grading. Unlike production incidents where success is hard to measure objectively, AntiAtropos provides a clean
[0.0, 1.0]composite score per episode β making benchmark comparisons across models and policies reproducible and auditable.
LLM-as-SRE: Zero-Shot Incident Response Evaluation
inference.py provides a complete evaluation harness for testing frontier LLMs as zero-shot SRE agents. Set your API key, pick a model, and run β the script handles the full episode loop: observation formatting, action parsing, constraint enforcement, and final grading.
set OPENAI_API_KEY=your_key_here
set MODEL_NAME=gpt-4.1
set ANTIATROPOS_TASK=task-3
python inference.py
This makes AntiAtropos a drop-in benchmark for comparing how well different LLMs reason about infrastructure operations β a capability that is increasingly relevant as AI models are integrated into on-call tooling, runbook automation, and incident triage systems.
OpenEnv Specification Compliance
AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:
Actionmodel:SREActioninmodels.pyObservationmodel:ClusterObservation+NodeObservationinmodels.pystep(action)returns observation with reward/done fieldsreset()returns initial cluster observationstateis exposed through the OpenEnvStateobjectopenenv.yamlis present at repository root
OpenEnv manifest:
- name:
AntiAtropos - runtime:
fastapi - app:
server.app:app - port:
7860
Environment Dynamics
Queueing Model
For each node i, AntiAtropos uses a fluid queue update:
Q_i(t+1) = max(Q_i(t) + lambda_eff_i(t) - mu_i(t), 0)
where:
lambda_eff_i = lambda_incoming_i * (1 - shed_fraction_i)mu_i = capacity_i * 15requests/tick (or0if node has failed)
Latency and CPU Utilization
cpu_i = lambda_incoming_i / mu_ilatency_i = BASE_LATENCY_MS + LATENCY_STEEPNESS * Q_i
Lyapunov Stability
Core stability objective is weighted Lyapunov energy:
V(s) = sum_i (w_i * Q_i^2)
VIP/business-critical nodes carry higher weights w_i. Drift term:
DeltaV(t) = V(s_t) - V(s_{t-1})
Reward Function
R_raw_t = -(alpha * DeltaV_t + beta * Cost_t + gamma * SLA_violation_t)
Default weights: alpha = 0.002, beta = 0.01, gamma = 10.0.
Normalized:
R_norm_t = sigmoid((R_raw_t - midpoint) / temperature)
Dense step-level signal β not sparse terminal reward β that strongly penalizes SLA failures, invalid actions, and destabilizing queue growth.
Action Space
SREAction (models.py):
action_type:NO_OP|SCALE_UP|SCALE_DOWN|REROUTE_TRAFFIC|SHED_LOADtarget_node_id:node-0tonode-4parameter: bounded float with action-dependent semantics
Safety constraints enforced by control/validation.py:
SHED_LOADis forbidden on critical nodes (node-0,node-1,node-2)- Scale operations are bounded by node min/max capacity
- Invalid actions are counted and penalized in the final score
Observation Space
ClusterObservation:
- Task/mode/episode step metadata
- Active node count, normalized latency, error rate, backlog
- Cost per hour, Lyapunov energy
- SLA and invalid-action counters
- Raw and normalized reward fields
- Per-node
NodeObservationlist
NodeObservation (per node):
- Queue depth, latency, incoming request rate
- CPU utilization, health status
- VIP flag and importance weight
Task Suite
task-1 β Capacity Ramp (Easy)
Load starts near cluster capacity and ramps over the episode. The agent must proactively scale and contain queue growth without overprovisioning. A clean benchmark for predictive capacity planning β the most common form of infrastructure toil.
task-2 β Fault Tolerance (Medium)
A non-VIP node fails at a randomized tick. Traffic continues hitting failed capacity until the agent detects the failure and responds. Tests reactive incident response: detecting failure signals, rerouting affected traffic, and compensating with scaling β under realistic delay constraints.
task-3 β Stability Under Surge (Hard)
Major traffic surges target non-critical nodes, threatening to cascade. The agent must protect the VIP Payment Gateway (node-0). SHED_LOAD is forbidden on critical nodes (node-0, node-1, and node-2). The agent must coordinate pre-emptive SCALE_UP to absorb the surge before it arrives and use persistent REROUTE_TRAFFIC to redirect load, all while maintaining cost discipline. The closest analogue to a real high-severity incident: time pressure, safety constraints, and no single correct action.
Grading (0.0β1.0)
Computed by grader.py β deterministic and reproducible:
| Component | Formula | Weight |
|---|---|---|
| Uptime | Fraction of steps with latency β€ 0.20 and error rate β€ 0.05 | 0.4 |
| Cost | exp(-3.0 * over_ratio) β punishes overprovisioning |
0.4 |
| Stability | 1 / (1 + (avg_energy / TARGET_ENERGY)^power) |
0.2 |
- task-3: cost contribution disabled when uptime
< 0.5 - Invalid-action penalty:
-0.05per invalid action - Final value clipped at
0.0
Observability Stack
AntiAtropos ships a full production-style observability stack:
- Prometheus scrapes environment metrics at
GET /metrics - Grafana
antiatropos-overviewdashboard: reward trajectory, queue heatmaps, latency timeseries, SLA violations, per-node state, action throughput, executor reliability - NGINX reverse proxy exposes
/,/prometheus/, and/grafana/on port7860 deploy/entrypoint.shboots the full stack in a single container
Kubernetes Integration
For teams evaluating agents against real infrastructure:
control/kubernetes_executor.pytranslatesSCALE_UP/SCALE_DOWNintokubectloperations on mapped deployments- Configure via
ANTIATROPOS_WORKLOAD_MAPorANTIATROPOS_NODE_DEPLOYMENT_MAP telemetry/prometheus_client.pyingests live PromQL metrics and reconciles them into simulator state via weighted blending β enabling a real-environment feedback loop with minimal code change
Baseline Scores
Reproducible NO-OP baseline over 20 seeded runs (100 steps each):
| Task | Mean Composite | Min | Max |
|---|---|---|---|
| task-1 | 0.6980 | 0.6845 | 0.7171 |
| task-2 | 0.7020 | 0.6400 | 0.7560 |
| task-3 | 0.2063 | 0.1721 | 0.2521 |
Task-3's low baseline score reflects the genuine difficulty of burst surge management under safety constraints β and the substantial headroom available for capable agents.
Setup and Usage
Local Python
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
Docker
docker build -t antiatropos:latest .
docker run --rm -p 7860:7860 antiatropos:latest
OpenEnv Validation
openenv validate
Hugging Face Space Deployment
openenv push
Project Structure
| Path | Description |
|---|---|
models.py |
Typed OpenEnv action/observation models |
simulator.py |
Queueing physics, task dynamics, action semantics |
stability.py |
Lyapunov/reward math |
grader.py |
Deterministic episode scoring |
inference.py |
OpenAI-compatible baseline runner |
client.py |
OpenEnv client wrapper |
openenv.yaml |
Environment manifest |
server/AntiAtropos_environment.py |
Environment runtime (reset, step, state handling) |
server/app.py |
FastAPI/OpenEnv app + /metrics |
control/ |
Action validation and Kubernetes executor |
telemetry/ |
Prometheus ingestion, metric mapping, exporter |
deploy/ |
Entrypoint, NGINX, Prometheus, Grafana provisioning |
Dockerfile, server/Dockerfile |
Container build targets |
Reproducibility
- Containerized execution (
Dockerfile,server/Dockerfile) - Pinned dependency lockfile (
uv.lock) - Deterministic grading equations (
grader.py) - Explicit reward equations in code β no black-box scoring
- Configurable environment variables for mode, telemetry endpoints, and policy runtime
For fixed-seed studies, use controlled simulator seeding in evaluation harnesses.
Evaluation Alignment
| Criterion | AntiAtropos |
|---|---|
| Real-world utility | Genuine SRE/platform engineering control task with production-grade operational constraints |
| Task quality | 3 tasks with easy-medium-hard progression mapped to real incident categories |
| Grader quality | Deterministic, interpretable composite score in [0, 1] |
| Environment design | Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries |
| Code quality | Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime |
| Novelty | Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design |