AntiAtropos: Autonomous SRE Control Environment (OpenEnv)

A production-grade RL/agent environment for the future of autonomous DevOps — where intelligent agents replace fragile runbooks, reduce on-call toil, and keep infrastructure healthy without human intervention.

AntiAtropos is an open, high-fidelity environment for training and benchmarking AI agents on site reliability engineering (SRE) — the discipline that keeps production infrastructure alive at scale. It models a live five-node microservice cluster operating under realistic production pressures: demand surges, cascading node failures, SLA deadlines, and hard safety constraints on critical services.

This is not a toy grid world or an abstract planning problem. Every action type, every penalty function, and every telemetry field in AntiAtropos was designed to mirror the exact decisions an on-call engineer faces when the PagerDuty alert fires at 3 AM.

The Problem: Infrastructure Operations Don't Scale With Humans

Modern platform teams operate infrastructure that is orders of magnitude more complex than the teams managing it. The result is a well-documented set of pain points:

On-call toil. Engineers are paged for incidents that follow predictable patterns — traffic spikes, memory pressure, node failures — and execute the same runbooks repeatedly. This is high-stress, low-leverage work that burns out senior engineers.
Reactive, not proactive. Static autoscaling policies (HPA, VPA) react to thresholds but cannot reason ahead about demand trajectories, reroute traffic away from degrading nodes, or balance cost and reliability over time.
Runbook rot. Documented procedures go stale. Edge cases accumulate. The institutional knowledge that makes incident response fast lives in engineers' heads, not in systems.

AntiAtropos is a training and evaluation ground for agents that solve this problem — systems that can observe cluster telemetry, reason about multi-step consequences, and issue control actions that keep services healthy, cost-efficient, and resilient.

What AntiAtropos Trains Agents to Do

An agent operating in AntiAtropos executes the same core loop a platform engineer runs continuously:

1. Observe the cluster state. The observation space mirrors real Prometheus/Grafana metrics: request rates, p99 latency, error rates, queue backlogs, CPU utilization, and per-node health — the same signals that drive every serious SRE incident workflow.

2. Reason about what is wrong and why. The environment implements genuine queueing dynamics with boot delays, traffic reroute decay, and Lyapunov-based stability measurement. Agents that only react to threshold breaches perform poorly; agents that build a causal model of the cluster perform well.

3. Issue control actions with real operational semantics.

SCALE_UP — expand node capacity (with a realistic BOOT_DELAY_TICKS = 5 cold-start delay)
SCALE_DOWN — reduce capacity and cost
REROUTE_TRAFFIC — shift request load away from unhealthy nodes
SHED_LOAD — drop a fraction of traffic to protect the cluster (forbidden on critical nodes)
NO_OP — hold position when the system is stable

These are not abstract symbols. They map directly to kubectl scale, traffic policy overrides, and rate limiter controls used in production Kubernetes environments.

4. Balance competing objectives across time. Uptime vs. cost vs. stability is the fundamental trade-off every platform team navigates. Brute-force overprovisioning fails the cost grader. Underprovisioning fails SLAs. The agent must plan — not just react.

5. Respect hard safety constraints. Critical nodes cannot have load shed. Scale operations are bounded. Invalid actions are penalized. AntiAtropos enforces the same guardrails that production runbooks encode, rewarding agents that understand operational boundaries.

Why This Matters for AIOps

The trajectory of platform engineering is clear: the toil layer gets automated, and engineers move up the stack. AntiAtropos provides the training and evaluation infrastructure to accelerate that transition responsibly:

Benchmark before you deploy. An agent evaluated on AntiAtropos has been tested against capacity ramps, node failures, and burst surges with safety constraints — covering the incident categories that account for the majority of real production pages.
Dense, informative feedback. Most production telemetry arrives in sparse, high-dimensional streams. AntiAtropos provides step-level Lyapunov-grounded reward signals that give learning algorithms meaningful gradient information at every tick — not just at episode end.
Composable with real infrastructure. The Kubernetes executor (control/kubernetes_executor.py) and Prometheus ingestion (telemetry/prometheus_client.py) make it possible to wire a trained policy into a real cluster with minimal adaptation, enabling true hybrid-autonomy workflows where the agent handles routine incidents and escalates novel ones.
Deterministic grading. Unlike production incidents where success is hard to measure objectively, AntiAtropos provides a clean [0.0, 1.0] composite score per episode — making benchmark comparisons across models and policies reproducible and auditable.

LLM-as-SRE: Zero-Shot Incident Response Evaluation

inference.py provides a complete evaluation harness for testing frontier LLMs as zero-shot SRE agents. Set your API key, pick a model, and run — the script handles the full episode loop: observation formatting, action parsing, constraint enforcement, and final grading.

set OPENAI_API_KEY=your_key_here
set MODEL_NAME=gpt-4.1
set ANTIATROPOS_TASK=task-3
python inference.py

This makes AntiAtropos a drop-in benchmark for comparing how well different LLMs reason about infrastructure operations — a capability that is increasingly relevant as AI models are integrated into on-call tooling, runbook automation, and incident triage systems.

OpenEnv Specification Compliance

AntiAtropos implements typed OpenEnv interfaces using Pydantic models and an OpenEnv-compatible FastAPI server:

Action model: SREAction in models.py
Observation model: ClusterObservation + NodeObservation in models.py
step(action) returns observation with reward/done fields
reset() returns initial cluster observation
state is exposed through the OpenEnv State object
openenv.yaml is present at repository root

OpenEnv manifest:

name: AntiAtropos
runtime: fastapi
app: server.app:app
port: 7860

Environment Dynamics

Queueing Model

For each node i, AntiAtropos uses a fluid queue update:

Q_i(t+1) = max(Q_i(t) + lambda_eff_i(t) - mu_i(t), 0)

where:

lambda_eff_i = lambda_incoming_i * (1 - shed_fraction_i)
mu_i = capacity_i * 15 requests/tick (or 0 if node has failed)

Latency and CPU Utilization

cpu_i = lambda_incoming_i / mu_i
latency_i = BASE_LATENCY_MS + LATENCY_STEEPNESS * Q_i

Lyapunov Stability

Core stability objective is weighted Lyapunov energy:

V(s) = sum_i (w_i * Q_i^2)

VIP/business-critical nodes carry higher weights w_i. Drift term:

DeltaV(t) = V(s_t) - V(s_{t-1})

Reward Function

R_raw_t = -(alpha * DeltaV_t + beta * Cost_t + gamma * SLA_violation_t)

Default weights: alpha = 0.002, beta = 0.01, gamma = 10.0.

Normalized:

R_norm_t = sigmoid((R_raw_t - midpoint) / temperature)

Dense step-level signal — not sparse terminal reward — that strongly penalizes SLA failures, invalid actions, and destabilizing queue growth.

Action Space

SREAction (models.py):

action_type: NO_OP | SCALE_UP | SCALE_DOWN | REROUTE_TRAFFIC | SHED_LOAD
target_node_id: node-0 to node-4
parameter: bounded float with action-dependent semantics

Safety constraints enforced by control/validation.py:

SHED_LOAD is forbidden on critical nodes (node-0, node-1, node-2)
Scale operations are bounded by node min/max capacity
Invalid actions are counted and penalized in the final score

Observation Space

ClusterObservation:

Task/mode/episode step metadata
Active node count, normalized latency, error rate, backlog
Cost per hour, Lyapunov energy
SLA and invalid-action counters
Raw and normalized reward fields
Per-node NodeObservation list

NodeObservation (per node):

Queue depth, latency, incoming request rate
CPU utilization, health status
VIP flag and importance weight

Task Suite

`task-1` — Capacity Ramp (Easy)

Load starts near cluster capacity and ramps over the episode. The agent must proactively scale and contain queue growth without overprovisioning. A clean benchmark for predictive capacity planning — the most common form of infrastructure toil.

`task-2` — Fault Tolerance (Medium)

A non-VIP node fails at a randomized tick. Traffic continues hitting failed capacity until the agent detects the failure and responds. Tests reactive incident response: detecting failure signals, rerouting affected traffic, and compensating with scaling — under realistic delay constraints.

`task-3` — Stability Under Surge (Hard)

Major traffic surges target non-critical nodes, threatening to cascade. The agent must protect the VIP Payment Gateway (node-0). SHED_LOAD is forbidden on critical nodes (node-0, node-1, and node-2). The agent must coordinate pre-emptive SCALE_UP to absorb the surge before it arrives and use persistent REROUTE_TRAFFIC to redirect load, all while maintaining cost discipline. The closest analogue to a real high-severity incident: time pressure, safety constraints, and no single correct action.

Grading (0.0–1.0)

Computed by grader.py — deterministic and reproducible:

Component	Formula	Weight
Uptime	Fraction of steps with latency ≤ 0.20 and error rate ≤ 0.05	0.4
Cost	`exp(-3.0 * over_ratio)` — punishes overprovisioning	0.4
Stability	`1 / (1 + (avg_energy / TARGET_ENERGY)^power)`	0.2

task-3: cost contribution disabled when uptime < 0.5
Invalid-action penalty: -0.05 per invalid action
Final value clipped at 0.0

Observability Stack

AntiAtropos ships a full production-style observability stack:

Prometheus scrapes environment metrics at GET /metrics
Grafana antiatropos-overview dashboard: reward trajectory, queue heatmaps, latency timeseries, SLA violations, per-node state, action throughput, executor reliability
NGINX reverse proxy exposes /, /prometheus/, and /grafana/ on port 7860
deploy/entrypoint.sh boots the full stack in a single container

Kubernetes Integration

For teams evaluating agents against real infrastructure:

control/kubernetes_executor.py translates SCALE_UP/SCALE_DOWN into kubectl operations on mapped deployments
Configure via ANTIATROPOS_WORKLOAD_MAP or ANTIATROPOS_NODE_DEPLOYMENT_MAP
telemetry/prometheus_client.py ingests live PromQL metrics and reconciles them into simulator state via weighted blending — enabling a real-environment feedback loop with minimal code change

Baseline Scores

Reproducible NO-OP baseline over 20 seeded runs (100 steps each):

Task	Mean Composite	Min	Max
task-1	0.6980	0.6845	0.7171
task-2	0.7020	0.6400	0.7560
task-3	0.2063	0.1721	0.2521

Task-3's low baseline score reflects the genuine difficulty of burst surge management under safety constraints — and the substantial headroom available for capable agents.

Setup and Usage

Local Python

pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000

Docker

docker build -t antiatropos:latest .
docker run --rm -p 7860:7860 antiatropos:latest

OpenEnv Validation

openenv validate

Hugging Face Space Deployment

openenv push

Project Structure

Path	Description
`models.py`	Typed OpenEnv action/observation models
`simulator.py`	Queueing physics, task dynamics, action semantics
`stability.py`	Lyapunov/reward math
`grader.py`	Deterministic episode scoring
`inference.py`	OpenAI-compatible baseline runner
`client.py`	OpenEnv client wrapper
`openenv.yaml`	Environment manifest
`server/AntiAtropos_environment.py`	Environment runtime (`reset`, `step`, state handling)
`server/app.py`	FastAPI/OpenEnv app + `/metrics`
`control/`	Action validation and Kubernetes executor
`telemetry/`	Prometheus ingestion, metric mapping, exporter
`deploy/`	Entrypoint, NGINX, Prometheus, Grafana provisioning
`Dockerfile`, `server/Dockerfile`	Container build targets

Reproducibility

Containerized execution (Dockerfile, server/Dockerfile)
Pinned dependency lockfile (uv.lock)
Deterministic grading equations (grader.py)
Explicit reward equations in code — no black-box scoring
Configurable environment variables for mode, telemetry endpoints, and policy runtime

For fixed-seed studies, use controlled simulator seeding in evaluation harnesses.

Evaluation Alignment

Criterion	AntiAtropos
Real-world utility	Genuine SRE/platform engineering control task with production-grade operational constraints
Task quality	3 tasks with easy-medium-hard progression mapped to real incident categories
Grader quality	Deterministic, interpretable composite score in `[0, 1]`
Environment design	Dense Lyapunov-grounded reward, clean reset/step loop, explicit episode boundaries
Code quality	Typed Pydantic models, modular components, OpenEnv manifest, containerized runtime
Novelty	Lyapunov reward shaping + live K8s control plane + Prometheus telemetry + observability-first design

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support