Mist-ic commited on
Commit
ec93d50
Β·
1 Parent(s): 99b8b51

Add comprehensive README with action/observation docs, setup guide, and architecture

Browse files
Files changed (1) hide show
  1. README.md +157 -3
README.md CHANGED
@@ -1,7 +1,161 @@
1
  # SevZero β€” SRE Incident Response Environment
2
 
3
- An autonomous on-call SRE managing a microservice cluster undergoing cascading failures.
4
 
5
- Built with [OpenEnv](https://github.com/meta-pytorch/OpenEnv) for the OpenEnv AI Hackathon 2026.
6
 
7
- > Full documentation coming soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # SevZero β€” SRE Incident Response Environment
2
 
3
+ A reinforcement learning environment where AI agents act as autonomous on-call Site Reliability Engineers managing microservice clusters undergoing cascading failures.
4
 
5
+ Built with [OpenEnv](https://github.com/meta-pytorch/OpenEnv) for the **OpenEnv AI Hackathon 2026**.
6
 
7
+ ## Why SRE Incident Response?
8
+
9
+ Incident response is one of the most expensive and error-prone aspects of running production systems. Engineers must rapidly diagnose root causes from noisy signals, contain blast radius, and restore service health β€” often under 3 AM pressure. SevZero provides a realistic simulation environment for training and evaluating AI agents on this critical task.
10
+
11
+ The environment models:
12
+ - **Realistic microservice topologies** with typed service layers (edge, identity, business, infrastructure)
13
+ - **Cascading failures** driven by queueing theory (Little's Law, M/M/c approximation, retry amplification)
14
+ - **Circuit breaker state machines** (CLOSED β†’ OPEN β†’ HALF_OPEN β†’ CLOSED)
15
+ - **8 failure types** weighted by real-world incident data (config errors 32%, bad deploys 25%, cascading latency 15%, crashes 10%, resource leaks 8%, DB degradation 5%, cache failures 3%, network errors 2%)
16
+ - **Framework-specific log patterns** from Spring Boot, Node.js, FastAPI, Kubernetes, HikariCP, Redis, and gRPC
17
+
18
+ ## Tasks
19
+
20
+ | Task | Services | Steps | Failures | Description |
21
+ |------|----------|-------|----------|-------------|
22
+ | **Easy** | 3–5 | 10 | 1 | Single service outage in a linear chain. Diagnose and fix within 10 steps. |
23
+ | **Medium** | 8–15 | 20 | 2–3 | Cascading failure from shared infrastructure through a branching dependency graph. |
24
+ | **Hard** | 15–30 | 50 | 4–6 | Multiple simultaneous root causes with conflicting mitigations across a complex mesh topology. |
25
+
26
+ All scenarios are procedurally generated from a seed for full determinism.
27
+
28
+ ## Action Space
29
+
30
+ The agent can issue 11 action types via `{"action_type": "...", "params": {...}}`:
31
+
32
+ | Action | Parameters | Effect |
33
+ |--------|-----------|--------|
34
+ | `inspect_logs` | `service_id` | View recent logs for a service (free action) |
35
+ | `inspect_metrics` | `service_id` | View metric history for a service (free action) |
36
+ | `inspect_traces` | `service_id` | View distributed traces through a service (free action) |
37
+ | `restart_service` | `service_id` | Restart a service (fixes crashes, resource leaks) |
38
+ | `rollback_service` | `service_id` | Roll back to previous version (fixes bad deploys) |
39
+ | `scale_service` | `service_id`, `replicas` | Scale horizontally (helps with load) |
40
+ | `tune_config` | `service_id`, `key`, `value` | Update configuration (fixes config errors) |
41
+ | `clear_cache` | `cache_name` | Flush a cache service |
42
+ | `rebalance_traffic` | `from_region`, `to_region`, `pct` | Shift traffic between regions |
43
+ | `pause_job` | `job_name` | Pause a background job |
44
+ | `noop` | β€” | Do nothing, advance one tick |
45
+
46
+ Remediation actions have 1–4 tick delays before taking effect. Inspect actions are free (no tick cost beyond the step).
47
+
48
+ ## Observation Space
49
+
50
+ Observations are ordered by SRE triage priority:
51
+
52
+ - **Episode context**: `tick`, `episode_id`, `task_id`, `status`, `max_steps`
53
+ - **Health summary**: `global_slo_score` (0.0–1.0), `observation_summary`
54
+ - **Per-service state**: `services[]` β€” each with `id`, `layer`, `status`, `error_rate`, `latency_p50/p95/p99_ms`, `throughput_rps`, `cpu_pct`, `memory_pct`, `connection_pool_usage_pct`, `replicas`, `version`, `depends_on`, `circuit_breakers`
55
+ - **Active alerts**: sorted by severity (`critical` > `warning` > `info`)
56
+ - **Context**: `recent_deploys`, `actions_taken` (history of agent's actions and outcomes)
57
+ - **Action space**: `legal_actions` with valid targets for each action type
58
+ - **Diagnostic output**: `logs`, `metric_history`, `traces` (populated after `inspect_*` actions)
59
+
60
+ ## Grading
61
+
62
+ Episodes are scored deterministically on a 0.0–1.0 scale:
63
+
64
+ ```
65
+ score = slo_recovery Γ— 0.70 + action_efficiency Γ— 0.15 + time_efficiency Γ— 0.15
66
+ ```
67
+
68
+ - **SLO Recovery (70%)**: Final global SLO score across all services
69
+ - **Action Efficiency (15%)**: Ratio of effective actions to total actions (penalizes excessive inspection without remediation)
70
+ - **Time Efficiency (15%)**: How quickly the agent resolves the incident relative to the step budget
71
+
72
+ A +10% bonus is applied when the episode terminates with full resolution (all failures remediated, SLO = 1.0).
73
+
74
+ ## Setup
75
+
76
+ ### Prerequisites
77
+
78
+ - Python 3.11+
79
+ - [uv](https://docs.astral.sh/uv/) (recommended) or pip
80
+
81
+ ### Install
82
+
83
+ ```bash
84
+ git clone https://github.com/mist-ic/SevZero.git
85
+ cd SevZero
86
+ uv sync
87
+ ```
88
+
89
+ ### Run the Server
90
+
91
+ ```bash
92
+ uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
93
+ ```
94
+
95
+ ### Run Tests
96
+
97
+ ```bash
98
+ uv run pytest tests/ -v
99
+ ```
100
+
101
+ ### Run Baseline Inference
102
+
103
+ Requires an LLM API endpoint:
104
+
105
+ ```bash
106
+ export API_BASE_URL="https://router.huggingface.co/v1"
107
+ export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
108
+ export HF_TOKEN="your-token-here"
109
+ export ENV_URL="http://localhost:7860"
110
+
111
+ uv run python inference.py
112
+ ```
113
+
114
+ ### Validate OpenEnv Compliance
115
+
116
+ ```bash
117
+ uv run openenv validate
118
+ ```
119
+
120
+ ### Docker
121
+
122
+ ```bash
123
+ docker build -t sevzero .
124
+ docker run -p 7860:7860 sevzero
125
+ ```
126
+
127
+ ## API Endpoints
128
+
129
+ | Endpoint | Method | Description |
130
+ |----------|--------|-------------|
131
+ | `/ws` | WebSocket | OpenEnv evaluation protocol (primary) |
132
+ | `/health` | GET | Health check |
133
+ | `/reset` | POST | Reset environment with `{"task_id": "easy", "seed": 42}` |
134
+ | `/step` | POST | Execute action with `{"action": {"action_type": "...", "params": {...}}}` |
135
+ | `/state` | GET | Current environment state |
136
+ | `/tasks` | GET | List available tasks |
137
+ | `/grader` | POST | Score an episode |
138
+ | `/docs` | GET | Interactive API documentation |
139
+
140
+ ## Architecture
141
+
142
+ ```
143
+ inference.py ← Baseline LLM agent (OpenAI client)
144
+ server/
145
+ app.py ← FastAPI app + stateful HTTP routes
146
+ environment.py ← OpenEnv Environment subclass (reset/step/state)
147
+ simulator.py ← Discrete-event simulation engine
148
+ propagation.py ← Queueing theory cascade engine + circuit breakers
149
+ failures.py ← 8 failure types with temporal metric signatures
150
+ scenarios.py ← Procedural scenario generation (3 difficulty tiers)
151
+ graph.py ← Service topology generation
152
+ logs.py ← Framework-specific log templates
153
+ traces.py ← Distributed trace generation
154
+ models.py ← Pydantic API contract (Action, Observation, State)
155
+ ```
156
+
157
+ The simulator runs a tick-based loop: each step, failures evolve their metric signatures, propagation cascades through the dependency graph via queueing theory, pending remediation effects resolve after their delay, and the agent receives an updated observation.
158
+
159
+ ## License
160
+
161
+ MIT