SevZero Bot commited on
Commit
b05ccd5
·
2 Parent(s): afc18869995985

Merge wave1/story-assets: README rewrite, BLOG, VIDEO_SCRIPT, asset templates

Browse files
BLOG.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SevZero: from simulator to a trainable SRE war-room (Round 2)
2
+
3
+ *HF blog draft — no inline hosted images; upload plots separately and replace the placeholders below.*
4
+
5
+ ## The autopsy (hook)
6
+
7
+ At step fourteen, an untrained 8B model panicked and restarted the primary database, turning a minor latency spike into a regional outage. 300 steps later, it learned to throttle background jobs instead. This is SevZero.
8
+
9
+ That failure was not a toy bug hunt. In production, the damage lives in a few irreversible actions taken under pressure: wrong service restarted, change applied without a rollback plan, a primary store touched when a leaf service was the root cause. SevZero is built to make those mistakes *expensive* in simulation so policy learning can make them *rare* in policy.
10
+
11
+ In Round 1 we shipped a deterministic, OpenEnv-native incident simulator: queues, breakers, SLOs, and eight failure types with distinct log signatures. In Round 2 the product is not “more of the same environment.” It is a **self-evolving SRE war-room** — non-stationary observations, an oversight channel for the riskiest tool calls, a curriculum that tightens the incident as the agent’s rolling reward improves, and reward components dense enough for GRPO to see gradients instead of a flat line.
12
+
13
+ ## The environment: what is novel
14
+
15
+ **Core:** partial observability, delayed effects, and propagation along a service DAG. The agent never sees a labeled root cause. It can only use the same surfaces a human on-call has—metrics, logs, traces—and the same *classes* of actions: `inspect_*` diagnostics, `restart_service`, `rollback_service`, `scale_service`, `tune_config`, `clear_cache`, `rebalance_traffic`, and a few more. That matters: failures propagate through a dependency graph; circuit breakers open and close with delay; a bad restart on an upstream can look like a downstream cache miss until you read the trace.
16
+
17
+ The scalar score is a blend of SLO recovery, action efficiency, and time under budget. The simulator is **deterministic for a given seed**—`random.Random(seed)` throughout—so a GRPO run that misbehaves is debuggable, and held-out eval seeds are true generalization over topology and failure mix, not replay of the same micro-incident in disguise.
18
+
19
+ **Round 2 upgrades (implementation-level):**
20
+
21
+ - **Schema drift** — a middleware path mutates the shapes and keys of `inspect_metrics` and `inspect_logs` responses while exposing a small change log in the observation. Rigid string parsing fails; semantic parsing survives. This tracks real production reality: your dashboards change version without your pager updating first.
22
+ - **Oversight** — a virtual SRE manager gates high-blast-radius actions (e.g. touching a primary data plane or draining a region at the wrong time). The model must learn *when* to request approval, not only *what* to type. That maps directly to the “weaker supervisor, stronger worker” story enterprises already run in shadow mode.
23
+ - **Adversarial curriculum (lite)** — as rolling performance crosses thresholds, the environment increases failure count, service count, and tightens the step budget. It is a performance-linked escalator, not a long table of hand-authored levels: the *distribution* of incidents shifts as the policy improves.
24
+ - **Fine-grained sub-rewards** — early GRPO runs hit a pattern we should own in public: the policy occasionally spammed `inspect_logs` to stay inside dense shaping and avoid committing to a fix. Tightening sub-reward structure—without hiding the real terminal SLO—restored non-zero group variance so GRPO had something to backpropagate.
25
+
26
+ ## The training pipeline: SFT, then GRPO
27
+
28
+ **Collect:** 100–150 expert-style trajectories from frontier chat models, filtered to a minimum episode score (we used ≥ `__FILL__`).
29
+
30
+ **SFT:** LoRA on Llama-3.1-8B-Instruct to lock in valid function-call JSON, incident vocabulary, and a “read before you break glass” inductive bias. Approximate run: `__FILL__` steps, effective batch `__FILL__`, LR `1e-5` (see repository training config for the exact file).
31
+
32
+ **GRPO:** *K* completions per prompt, group-relative advantages, and rollouts that hit the *same* HTTP OpenEnv the judges can open from a Space. The trainer does not get a hand-wavy stub reward: the FastAPI app runs the full tick engine, the grader, and the R2 modules. In TRL, wire custom rollouts through `rollout_func`—`environment_factory` is the legacy path that breaks silent on recent releases.
33
+
34
+ **Infra in practice:** vLLM (or a compatible server) for fast multi-completion sampling, LoRA on attention and MLP blocks for 8B, cosine LR schedule, and a 30–45 minute *health* window where we watch entropy, KL, and the fraction of steps with near-zero advantage standard deviation. If the curve is flat, the bug is usually integration—not “RL doesn’t work.”
35
+
36
+ High-level config that matched the GPU hours we had: rank `__FILL__`, LR in the `7e-6`–`1e-5` band, *K* of `4` or `8`, temperature `0.85`, β `0.04`, 300–400 steps. The exact job JSON and dependency pins live next to `train_grpo.py` in the repository.
37
+
38
+ **Why GRPO, not DPO?** DPO needs a static preference set over pairs; the failure modes here are multi-turn and path-dependent. GRPO’s per-group normalization lets the same prompt explore multiple remediation strategies and learn from the one that actually moves SLO under delayed physics.
39
+
40
+ **Why 8B?** A 70B API can score near the 0.929 frontier on aggregate benchmarks, but the deployment story for a regulated network is a local policy with auditable weights. The hackathon ask is to show a believable *lift* on that 8B class, not to pretend 8B equals Gemini on every seed.
41
+
42
+ ## Results
43
+
44
+ **What a judge should see in 10 seconds** — a line that starts near the *measured* untrained-8B floor, steps upward with visible slope changes, and approaches—but may not need to meet—the frontier at **0.929** (Gemini-3.1-Pro, aggregate of 28 reference runs on our protocol). A shaded band between the floor and the curve is the *learning delta* in points, not a decoration.
45
+
46
+ ![GRPO mean reward vs step](path/to/reward_curve.png)
47
+
48
+ - **Frontier line:** **0.929** (reference aggregate above).
49
+ - **Pre-GRPO 8B floor:** `__FILL__` (measured zero-shot on held-out seeds **13, 99, 777** — we deliberately avoid 42/123/7 that appeared in early baselines).
50
+ - **Post-GRPO:** `__FILL__` at step `__FILL__` (from `metrics.jsonl`); learning delta `+__FILL__` points in the figure above. Inflection captions are drafted from `assets/reward_curve.py` heuristics and edited against the run log for the final asset.
51
+
52
+ **Per-tier bars** are more legible to humans than a single scalar. Easy should look boring (everyone is high); *Hard* is where a weak policy collapses. That is the column we expect improvement to show up first if anything does.
53
+
54
+ ![Easy / medium / hard bars](path/to/scores_bar.png)
55
+
56
+ **Before/after** (same task and seed) is the human-readable twin of the curve: one JSONL line per step with action and observation text. The repository’s `assets/before_after.md` is the working template; the final post will include one medium and one hard excerpt once eval lands.
57
+
58
+ ## Lessons and failure modes (honest)
59
+
60
+ - **Reward hacking (inspect loop):** a short run spiked by spamming `inspect_logs` to farm dense shaping without remediating. We addressed it with repetition-style penalties in the sub-reward terms and a stronger terminal SLO term so “busy work” could not outscore a resolved incident.
61
+ - **Zero-advantage batches:** if every completion in a group gets the same return, GRPO has nothing to differentiate. The fine-grained sub-rewards and curriculum variance exist partly to keep group standard deviation alive.
62
+ - **What still breaks:** `__FILL__` (e.g. multi-region + simultaneous independent root causes in the Hard tier) — the honest answer in Q&A is that this is the next curriculum axis, not a reason to hand-wave the current metrics.
63
+
64
+ ## Reuse
65
+
66
+ - `pip install` / `uv sync` and Docker as in the GitHub `README.md`.
67
+ - OpenEnv schema and validation: the Space exposes the same routes evaluators expect.
68
+ - **Main Hub links (when live):** [`mist-ic/sevzero-env`](https://huggingface.co/spaces/mist-ic/sevzero-env) · [`mist-ic/sevzero-trackio`](https://huggingface.co/spaces/mist-ic/sevzero-trackio) · [`mist-ic/sevzero-llama3-8b-grpo`](https://huggingface.co/mist-ic/sevzero-llama3-8b-grpo) · [`mist-ic/sevzero-expert-trajectories`](https://huggingface.co/datasets/mist-ic/sevzero-expert-trajectories)
69
+
70
+ ---
71
+
72
+ Thanks to the OpenEnv team, Hugging Face TRL, and Unsloth for the post-training stack this round actually shipped on.
README.md CHANGED
@@ -1,363 +1,172 @@
1
- ---
2
- title: SevZero
3
- emoji: 🔥
4
- colorFrom: red
5
- colorTo: indigo
6
- sdk: docker
7
- pinned: false
8
- ---
9
 
10
- # SevZero - SRE Incident Response Environment
11
 
12
- **An RL environment where your agent is the on-call engineer: diagnose and fix cascading cloud failures before the system collapses.**
13
 
14
- A reinforcement learning benchmark where AI agents must act as autonomous Site Reliability Engineers (SREs) -- the people responsible for keeping production systems running. The agent observes a live microservice cluster, reads alerts and logs, and issues remediation commands to restore service health before the incident escalates.
15
-
16
- Built with [OpenEnv](https://github.com/meta-pytorch/OpenEnv) for the **OpenEnv AI Hackathon 2026**.
17
 
18
  ---
19
 
20
- ## Why This Exists
21
-
22
- Most RL benchmarks (Atari, MuJoCo, MiniGrid) train agents on perception and motor control. They do not train agents to reason causally under partial observability, manage multi-step diagnostic workflows, or handle high-stakes irreversible actions with delayed consequences.
23
 
24
- SRE incident response requires all of these things simultaneously:
25
-
26
- - **Partial observability**: The agent cannot see root causes directly -- only noisy symptoms (error rates, latency spikes, memory graphs, log lines)
27
- - **Causal reasoning**: Fixing the wrong service first can cascade into a wider outage
28
- - **Delayed consequences**: A service restart takes multiple simulation ticks to take effect
29
- - **Time pressure**: The scoring penalty for slow resolution compounds every step
30
-
31
- This directly mirrors the skills needed to deploy autonomous agents in real infrastructure. SRE agents are already being built and deployed in production by major cloud providers -- SevZero provides a safe, reproducible simulation environment to develop and benchmark them.
32
-
33
- | Property | Games (Atari, MuJoCo) | SevZero |
34
- |---|---|---|
35
- | Observability | Full or structured | Genuinely partial (logs, metrics, alerts only) |
36
- | Action consequences | Immediate (next frame) | Delayed (restarts take multiple ticks) |
37
- | Causal structure | Fixed physics | Dynamic service dependency graphs |
38
- | Reward signal | Synthetic game score | SLO compliance, MTTR, action efficiency |
39
- | Procedural diversity | Fixed levels | Infinite graph topologies from seed |
40
- | Domain transfer | Game-only | Maps directly to production systems |
41
 
42
  ---
43
 
44
- ## Quick Start
45
-
46
- ```bash
47
- git clone https://github.com/mist-ic/SevZero.git
48
- cd SevZero
49
- uv sync # or: pip install -e .
50
- uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
51
- ```
52
-
53
- Then interact via the HTTP API:
54
-
55
- ```python
56
- import httpx
57
 
58
- # Start a new episode
59
- resp = httpx.post("http://localhost:7860/reset", json={"task_id": "easy", "seed": 42})
60
- obs = resp.json()["observation"]
61
- print(f"SLO: {obs['global_slo_score']:.0%} | Summary: {obs['observation_summary']}")
62
-
63
- # Take an action
64
- resp = httpx.post("http://localhost:7860/step", json={
65
- "action": {"action_type": "inspect_logs", "params": {"service_id": "order-service"}}
66
- })
67
- print(resp.json()["observation"]["logs"])
68
- ```
69
-
70
- Or connect using the Python client:
71
-
72
- ```python
73
- from client import SevZeroEnv
74
- from models import SevZeroAction
75
-
76
- with SevZeroEnv(base_url="http://localhost:7860") as env:
77
- result = env.reset(task_id="medium", seed=123)
78
- action = SevZeroAction(action_type="inspect_logs", params={"service_id": "auth-service"})
79
- result = env.step(action)
80
- print(result.observation.logs)
81
- ```
82
 
83
  ---
84
 
85
- ## Tasks
86
-
87
- | Task | Services | Steps | Simultaneous Failures | Description |
88
- |------|----------|-------|-----------------------|-------------|
89
- | **easy** | 3-5 | 10 | 1 | Single service outage in a linear chain. One root cause, straightforward diagnosis. |
90
- | **medium** | 8-15 | 20 | 2-3 | Cascading failure through a branching dependency graph. Fixing the wrong service first makes things worse. |
91
- | **hard** | 15-30 | 50 | 4-6 | Multi-region incident with simultaneous independent root causes. Requires correctly prioritizing across regions while managing propagation. |
92
-
93
- All scenarios are procedurally generated from a seed. Same seed always produces the same incident. Different seeds produce structurally distinct topologies.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ---
96
 
97
- ## Episode Trace
98
 
99
- A concrete walkthrough of one medium-difficulty episode (seed=123):
100
-
101
- ```
102
- EPISODE START | Task: medium | Max Steps: 20 | Seed: 123
103
- SLO: 50% | Services: auth-service [CRITICAL], api-gateway [DEGRADED], 6 healthy
104
-
105
- Step 0 - Observation
106
- Alerts:
107
- [CRITICAL] auth-service: error_rate=94%, p99=4800ms
108
- [WARNING] api-gateway: error_rate=28%, p99=1200ms (downstream effect)
109
- Recent deploys: auth-service -> v2.1.3 (2 ticks ago)
110
-
111
- Step 1 - Action: inspect_logs(service_id="auth-service")
112
- Logs reveal: "NullPointerException in UserSessionManager.validate()"
113
- "Caused by: incompatible schema in v2.1.3 migration"
114
- Reward: 0.0 (diagnostic step, free)
115
-
116
- Step 2 - Action: rollback_service(service_id="auth-service")
117
- Effect queued: auth-service rollback to v2.1.2 (resolves in 2 ticks)
118
- Reward: +0.35 (correct root cause action)
119
-
120
- Step 4 - auth-service rollback completes
121
- auth-service: CRITICAL -> HEALTHY (error_rate: 94% -> 1%)
122
- api-gateway: DEGRADED -> HEALTHY (upstream dependency restored)
123
- SLO: 50% -> 100%
124
-
125
- Step 4 - Action: inspect_logs(service_id="api-gateway")
126
- Logs: nominal, confirming recovery
127
- Reward: +0.10 (verification)
128
-
129
- EPISODE END | Score: 0.9325 | SLO Recovery: 100% | Steps: 4/20
130
- Termination: resolved
131
  ```
132
 
133
- The key dynamic: `api-gateway` was degraded not because of its own failure, but because `auth-service` (a dependency) was down. A naive agent that restarts `api-gateway` first wastes steps, gets penalized on action efficiency, and delays resolution.
134
 
135
  ---
136
 
137
- ## Action Space
138
 
139
- The agent issues one action per step as `{"action_type": "...", "params": {...}}`:
140
 
141
- | Action | Parameters | Description |
142
- |--------|-----------|-------------|
143
- | `inspect_logs` | `service_id` | Read recent log lines for a service (diagnostic, no tick cost) |
144
- | `inspect_metrics` | `service_id` | View metric history: CPU, memory, error rate, latency over last 10 ticks |
145
- | `inspect_traces` | `service_id` | View distributed traces showing call paths and error spans |
146
- | `restart_service` | `service_id` | Restart a crashed or leaking service (resolves in 1-2 ticks) |
147
- | `rollback_service` | `service_id` | Roll back to previous version (resolves in 2-3 ticks) |
148
- | `scale_service` | `service_id`, `replicas` | Scale replicas horizontally (helps with overload, resolves in 1 tick) |
149
- | `tune_config` | `service_id`, `key`, `value` | Fix a misconfigured key (resolves in 1 tick) |
150
- | `clear_cache` | `cache_name` | Flush a cache (resolves in 1 tick) |
151
- | `rebalance_traffic` | `from_region`, `to_region`, `pct` | Shift traffic between regions |
152
- | `pause_job` | `job_name` | Pause a background job consuming resources |
153
- | `noop` | (none) | Advance one tick without acting |
154
 
155
- Inspect actions are free (they add information without consuming a remediation step). Remediation actions have 1-4 tick delays before effects appear.
156
-
157
- ---
158
 
159
- ## Observation Space
160
-
161
- Each step returns a structured observation ordered by SRE triage priority:
162
-
163
- ```python
164
- {
165
- # Episode context
166
- "tick": 4,
167
- "episode_id": "3a4b...",
168
- "task_id": "medium",
169
- "status": "playing", # "playing" | "resolved" | "timeout"
170
- "max_steps": 20,
171
-
172
- # Health summary (read this first)
173
- "global_slo_score": 0.72, # 0.0 (all down) to 1.0 (all healthy)
174
- "observation_summary": "Tick 4/20: SLO compliance 72% (1 CRITICAL, 2 DEGRADED, 5 healthy)",
175
-
176
- # Per-service state
177
- "services": [{
178
- "id": "auth-service",
179
- "layer": "identity", # edge | identity | business | data | infrastructure
180
- "status": "critical", # healthy | degraded | critical | down
181
- "error_rate": 0.94,
182
- "latency_p50_ms": 320.0,
183
- "latency_p95_ms": 1800.0,
184
- "latency_p99_ms": 4800.0,
185
- "throughput_rps": 45.2,
186
- "cpu_pct": 88.0,
187
- "memory_pct": 76.0,
188
- "connection_pool_usage_pct": 95.0,
189
- "replicas": 2,
190
- "version": "v2.1.3",
191
- "depends_on": ["postgres-primary"],
192
- "circuit_breakers": {"api-gateway": "OPEN"}
193
- }],
194
-
195
- # Active alerts (sorted by severity)
196
- "alerts": [{"severity": "critical", "message": "auth-service error_rate=94%", ...}],
197
-
198
- # Context
199
- "recent_deploys": [{"service": "auth-service", "version": "v2.1.3", "ticks_ago": 2}],
200
- "actions_taken": [{"tick": 1, "action": "inspect_logs", "target": "auth-service", "success": true}],
201
-
202
- # What actions are currently valid
203
- "legal_actions": [{"action_type": "rollback_service", "valid_targets": ["auth-service"]}, ...],
204
-
205
- # Populated after inspect_* actions
206
- "logs": "ERROR auth-service NullPointerException in UserSessionManager...",
207
- "metric_history": {...},
208
- "traces": {"spans": [...]}
209
- }
210
  ```
211
 
212
- ---
213
 
214
- ## Reward Function
215
 
216
- ```
217
- score = slo_recovery x 0.70 + action_efficiency x 0.15 + time_efficiency x 0.15
218
  ```
219
 
220
- - **SLO Recovery (70%)**: Final global SLO score (fraction of services meeting their error rate / latency targets). A +10% bonus applies when the episode ends with full resolution (SLO = 1.0, all failures cleared).
221
- - **Action Efficiency (15%)**: Penalizes excessive actions. Ratio of minimum required actions to actual actions taken.
222
- - **Time Efficiency (15%)**: Penalizes slow resolution. Based on how many steps were used relative to the step budget.
223
 
224
- The reward is dense across the episode (delta-SLO shaping at each tick), not just binary at the end. An agent that partially fixes one of three failures gets partial credit proportional to the SLO improvement.
225
 
226
  ---
227
 
228
- ## Failure Types
229
 
230
- Eight failure types, each with a distinct diagnostic signature:
231
-
232
- | Failure Type | Log Pattern | Correct Fix |
233
- |---|---|---|
234
- | Bad deploy | NullPointerException / TypeError after recent deploy | `rollback_service` |
235
- | Config error | "Configuration diagnostic: key 'X' has invalid value" | `tune_config` with the exact key |
236
- | OOM / crash | OOMKilled, CrashLoopBackOff | `restart_service` |
237
- | Resource leak | Memory climbing linearly over 10+ ticks | `restart_service` |
238
- | DB degradation | HikariPool exhaustion, slow queries (CPU paradoxically low) | `scale_service` on the DB or `restart_service` |
239
- | Cache failure | CLUSTERDOWN, "cache miss rate 100%" | `clear_cache` |
240
- | Cascade | High latency on upstream causes downstream error spikes | Fix the upstream root cause first |
241
- | Network | DNS resolution failures, connection timeouts | `rebalance_traffic` |
242
-
243
- Failures propagate through the service dependency graph using queueing theory (Little's Law, M/M/c approximation). Circuit breakers (CLOSED -> OPEN -> HALF_OPEN -> CLOSED) dampen propagation with 1-2 tick delay.
244
 
245
  ---
246
 
247
- ## Baseline Scores
248
-
249
- Baseline agent: `llama-3.3-70b-versatile` via Groq (zero-shot, greedy, no fine-tuning).
250
-
251
- | Task | Score | SLO Recovery | Action Efficiency | Time Efficiency | Steps | Outcome |
252
- |------|-------|-------------|-------------------|-----------------|-------|---------|
253
- | easy | 0.9300 | 1.0000 | 0.8333 | 0.7000 | 3/10 | resolved |
254
- | medium | 0.9325 | 1.0000 | 0.7500 | 0.8000 | 4/20 | resolved |
255
- | hard | 0.7906 | 0.8800 | 0.9000 | 0.2640 | 50/50 | timeout |
256
- | **avg** | **0.8844** | **0.9600** | **0.8278** | **0.5880** | | |
257
-
258
- Full results: `outputs/baseline_latest.json`
259
-
260
- The easy and medium tasks are consistently resolved (SLO = 100%). The hard task requires correctly diagnosing and resolving 4-6 simultaneous failures across a multi-region topology within 50 steps -- a 70B model reaches 88% SLO recovery but runs out of steps before full resolution.
261
-
262
- ---
263
 
264
- ## Setup
265
-
266
- ### Install
267
 
268
  ```bash
269
  git clone https://github.com/mist-ic/SevZero.git
270
  cd SevZero
271
- uv sync
272
  ```
273
 
274
- ### Run the Server
275
 
276
  ```bash
277
  uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
278
  ```
279
 
280
- ### Run Tests
281
-
282
- ```bash
283
- uv run pytest tests/ -v
284
- ```
285
-
286
- ### Run Baseline Inference
287
-
288
- ```bash
289
- export API_BASE_URL="https://api.groq.com/openai/v1"
290
- export MODEL_NAME="llama-3.3-70b-versatile"
291
- export HF_TOKEN="your-groq-api-key"
292
- export ENV_URL="http://localhost:7860"
293
-
294
- uv run python inference.py
295
- # Results saved to outputs/baseline_latest.json
296
- ```
297
-
298
- ### Docker
299
 
300
  ```bash
301
  docker build -t sevzero .
302
- docker run -p 7860:7860 sevzero
303
  ```
304
 
305
- ### Validate OpenEnv Compliance
306
 
307
  ```bash
308
  uv run openenv validate
309
  uv run openenv validate --url http://localhost:7860
310
  ```
311
 
312
- ---
313
 
314
- ## API Endpoints
315
 
316
- | Endpoint | Method | Description |
317
- |----------|--------|-------------|
318
- | `/ws` | WebSocket | Primary evaluation protocol (used by OpenEnv framework) |
319
- | `/reset` | POST | Start episode: `{"task_id": "easy", "seed": 42}` |
320
- | `/step` | POST | Execute action: `{"action": {"action_type": "...", "params": {...}}}` |
321
- | `/state` | GET | Current episode state (task_id, seed, SLO score, step count) |
322
- | `/tasks` | GET | List available tasks with metadata |
323
- | `/grader` | POST | Score a completed episode |
324
- | `/health` | GET | Health check |
325
- | `/docs` | GET | Interactive API documentation |
326
 
327
  ---
328
 
329
- ## Architecture
330
 
 
 
 
 
 
 
 
331
  ```
332
- inference.py -- Baseline LLM agent (OpenAI-compatible client)
333
- client.py -- SevZeroEnv(EnvClient) for programmatic access
334
- models.py -- Pydantic API contract: SevZeroAction, SevZeroObservation, SevZeroState
335
- server/
336
- app.py -- FastAPI app wired via OpenEnv create_app() + custom routes
337
- environment.py -- SevZeroEnvironment(Environment): reset/step/state
338
- simulator.py -- Tick-based discrete-event engine
339
- propagation.py -- Queueing theory cascade engine + circuit breakers
340
- failures.py -- 8 failure types with temporal metric evolution curves
341
- scenarios.py -- Procedural scenario generation (3 difficulty tiers)
342
- graph.py -- Service topology: layered DAG with typed service roles
343
- logs.py -- Framework-specific log templates (Spring, Node, FastAPI, Redis, gRPC)
344
- traces.py -- Distributed trace generation
345
- grader.py -- Deterministic SLO-based scoring
346
- tests/ -- 37 tests: simulator determinism, grader bounds, propagation, actions
347
- ```
348
-
349
- The simulator runs a tick-based loop: at each step, active failures evolve their metric signatures, propagation cascades through the dependency graph via queueing theory, pending remediation effects resolve after their delay, and the agent receives an updated observation.
350
 
351
  ---
352
 
353
- ## Design Decisions
354
-
355
- **Determinism**: All randomness uses `random.Random(seed)` exclusively. Same seed always produces the same incident topology, failure sequence, and metric evolution. No numpy, no OS entropy.
356
-
357
- **Queueing theory cascades**: Propagation uses Little's Law (L = lambda x W), utilization rho = L/T, and latency multiplier 1/(1-rho). This means near-saturated services (rho > 0.9) experience nonlinear latency explosion -- realistic behavior seen in production systems.
358
-
359
- **Circuit breakers**: Services implement the standard CLOSED -> OPEN -> HALF_OPEN state machine. When a dependency fails, circuit breakers trip after a threshold, dampening further propagation. This prevents instant full-cluster collapse and gives agents meaningful time windows to diagnose and act.
360
-
361
- **Distinctive failure signatures**: Each failure type has a unique temporal metric pattern designed to require log inspection to diagnose correctly (for example: cascading latency spikes p99 before errors appear; resource leaks show linear memory growth over 10+ ticks; DB degradation shows CPU paradoxically low due to I/O wait).
362
-
363
- **What this environment does not model**: Actual network I/O, real containerized services, or multi-agent coordination. Service graphs are simulated, not real. The environment is designed for benchmarking agent decision-making, not as a digital twin.
 
1
+ # SevZero
 
 
 
 
 
 
 
2
 
3
+ **A self-evolving SRE war-room for training on-call AI agents.**
4
 
5
+ > At step fourteen, an untrained 8B model panicked and restarted the primary database, turning a minor latency spike into a regional outage. 300 steps later, it learned to throttle background jobs instead. This is SevZero.
6
 
7
+ In R1 we built the foundation; in R2 we turned it into a self-evolving SRE war-room: live curriculum pressure, schema drift, oversight for risky actions, and a training stack that shows up in reward curves, not just pull requests.
 
 
8
 
9
  ---
10
 
11
+ ## Live artifacts (main hosting)
 
 
12
 
13
+ | | |
14
+ |:--|:--|
15
+ | **HF Space (environment)** | [`huggingface.co/spaces/mist-ic/sevzero-env`](https://huggingface.co/spaces/mist-ic/sevzero-env) |
16
+ | **HF Space (Trackio / metrics)** | [`huggingface.co/spaces/mist-ic/sevzero-trackio`](https://huggingface.co/spaces/mist-ic/sevzero-trackio) |
17
+ | **HF Model (8B GRPO adapter)** | [`huggingface.co/mist-ic/sevzero-llama3-8b-grpo`](https://huggingface.co/mist-ic/sevzero-llama3-8b-grpo) |
18
+ | **HF Dataset (SFT / trajectories)** | [`huggingface.co/datasets/mist-ic/sevzero-expert-trajectories`](https://huggingface.co/datasets/mist-ic/sevzero-expert-trajectories) |
19
+ | **Blog (HF)** | `__BLOG_URL__` |
20
+ | **Video** | `__VIDEO_URL__` |
 
 
 
 
 
 
 
 
 
21
 
22
  ---
23
 
24
+ ## What’s new in R2
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ | Upgrade | What it does (one line) |
27
+ |--------|-------------------------|
28
+ | **Schema drift** | `inspect_metrics` / `inspect_logs` payloads and keys can change mid-episode; a change log keeps it fair. |
29
+ | **Oversight** | High-impact actions (e.g. primary DB, traffic drain) go through a virtual SRE manager: approve, deny, or ask for a safer plan. |
30
+ | **Adversarial curriculum** | As rolling reward crosses thresholds, the simulator adds failures, tightens the step budget, and scales topology difficulty. |
31
+ | **Fine-grained sub-rewards** | Dense step-wise signals so GRPO does not collapse into zero-advantage groups when SLO movement is small. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ---
34
 
35
+ ## Architecture (conceptual)
36
+
37
+ ```mermaid
38
+ flowchart LR
39
+ subgraph Agent
40
+ A[Policy LLM]
41
+ end
42
+ subgraph HTTP
43
+ H[OpenEnv / FastAPI]
44
+ end
45
+ subgraph Environment
46
+ S[Simulator + grader]
47
+ C[Curriculum + adversary]
48
+ O[Oversight / governance]
49
+ D[Schema drift]
50
+ end
51
+ A <--> H
52
+ H <--> S
53
+ H <--> C
54
+ H <--> O
55
+ H <--> D
56
+ ```
57
+
58
+ *Source: [`assets/architecture.md`](assets/architecture.md) (mermaid for editing).*
59
 
60
  ---
61
 
62
+ ## Training pipeline
63
 
64
+ ```mermaid
65
+ flowchart LR
66
+ T[Collect expert trajectories\nGemini / Claude / GPT] --> F[SFT\nLlama-3.1-8B-Instruct + LoRA]
67
+ F --> G[GRPO\nremote SevZero / TRL + vLLM]
68
+ G --> M[Model + eval on held-out seeds]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  ```
70
 
71
+ *Source: [`assets/training_pipeline.md`](assets/training_pipeline.md).*
72
 
73
  ---
74
 
75
+ ## Results
76
 
77
+ **Scores** (held-out eval seeds: **13, 99, 777** not 42/123/7 from baseline). Replace `__FILL__` when eval lands.
78
 
79
+ | Task | Baseline 8B | SFT | GRPO | Frontier (Gemini-3.1-Pro) |
80
+ |------|------------|-----|------|----------------------------|
81
+ | Easy | `__FILL__` | `__FILL__` | `__FILL__` | 0.930 |
82
+ | Medium | `__FILL__` | `__FILL__` | `__FILL__` | 0.970 |
83
+ | Hard | `__FILL__` | `__FILL__` | `__FILL__` | 0.887 |
84
+ | **Mean** | `__FILL__` | `__FILL__` | `__FILL__` | **0.929** |
 
 
 
 
 
 
 
85
 
86
+ **Reward curve (GRPO)** regenerate after each run:
 
 
87
 
88
+ ```text
89
+ python assets/reward_curve.py <path_to_metrics.jsonl> [--baseline __FILL__]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ```
91
 
92
+ ![GRPO reward vs step](assets/reward_curve.png)
93
 
94
+ **Bar chart (Easy / Medium / Hard)** — from `eval_results.csv` (produced by `training/eval.py`):
95
 
96
+ ```text
97
+ python assets/scores_bar.py path/to/eval_results.csv
98
  ```
99
 
100
+ ![Scores by task and stage](assets/scores_bar.png)
 
 
101
 
102
+ **Before / after** episode behavior: [`assets/before_after.md`](assets/before_after.md).
103
 
104
  ---
105
 
106
+ ## Theme and rubric mapping
107
 
108
+ | Criterion (weight) | How SevZero satisfies it |
109
+ |--------------------|--------------------------|
110
+ | Environment innovation (40%) | SRE sim + queueing cascades; R2: drift, oversight, curriculum, sub-reward density. |
111
+ | Storytelling (30%) | Autopsy hook, blog, short video, README, annotated plots. |
112
+ | Reward improvement (20%) | Logged GRPO `metrics.jsonl`, curve + bar + before/after traces. |
113
+ | Pipeline (10%) | SFT to GRPO, TRL `rollout_func`, scripts linked below. |
114
+ | *Themes* | World modeling (professional): multi-signal state; long-horizon: Hard tier; self-improvement: curriculum; multi-agent: oversight layer. |
 
 
 
 
 
 
 
115
 
116
  ---
117
 
118
+ ## Reproducibility
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
+ **Install (local)**
 
 
121
 
122
  ```bash
123
  git clone https://github.com/mist-ic/SevZero.git
124
  cd SevZero
125
+ uv sync # or: pip install -e .
126
  ```
127
 
128
+ **Run the environment**
129
 
130
  ```bash
131
  uv run uvicorn server.app:app --host 0.0.0.0 --port 7860
132
  ```
133
 
134
+ **Docker (reset to clean env)**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
 
136
  ```bash
137
  docker build -t sevzero .
138
+ docker run --rm -p 7860:7860 sevzero
139
  ```
140
 
141
+ **OpenEnv check**
142
 
143
  ```bash
144
  uv run openenv validate
145
  uv run openenv validate --url http://localhost:7860
146
  ```
147
 
148
+ **Training entrypoints** (see repo `training/` after merge): `collect_trajectories.py`, `build_dataset.py`, `train_sft.py`, `train_grpo.py`, `eval.py`. Colab-friendly paths are documented in the training README inside that package.
149
 
150
+ **Regenerate story plots**
151
 
152
+ ```bash
153
+ python assets/reward_curve.py training/outputs/grpo/metrics.jsonl
154
+ python assets/scores_bar.py training/outputs/eval_results.csv
155
+ ```
 
 
 
 
 
 
156
 
157
  ---
158
 
159
+ ## Cite
160
 
161
+ ```bibtex
162
+ @software{sevzero2026,
163
+ title = {SevZero: A Reinforcement Learning Environment for Site Reliability Engineering},
164
+ author = {SevZero Team},
165
+ year = {2026},
166
+ url = {https://github.com/mist-ic/SevZero}
167
+ }
168
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  ---
171
 
172
+ *Frontier ceiling (Gemini-3.1-Pro, 28-run aggregate): 0.929. Untrained 8B baseline for plots: `__FILL__` (see `metrics.jsonl` + zero-shot eval).*
 
 
 
 
 
 
 
 
 
 
VIDEO_SCRIPT.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SevZero R2 — video script (~110–130 s, under 2 min)
2
+
3
+ **On-screen text (0:00):** `SevZero` · `A self-evolving SRE war-room for on-call agents`
4
+
5
+ **0:00–0:15 — Autopsy hook**
6
+ *Spoken (~55 words):*
7
+ “At step fourteen, an untrained 8B model panicked and restarted the primary database, turning a minor latency spike into a regional outage. 300 steps later, it learned to throttle background jobs instead. This is SevZero — a trainable SRE environment where the mistakes are expensive so the policy can become safe.”
8
+
9
+ `[Brackets — visual: full-screen terminal or Space UI; one hard cut on “primary database” to a red SLO readout; no B-roll over the hook line.]`
10
+
11
+ **On-screen (0:12):** `R1: foundation` → `R2: self-evolving war-room`
12
+
13
+ ---
14
+
15
+ **0:15–0:45 — What it is + four R2 upgrades**
16
+ *Spoken (~100 words):*
17
+ “In round one we built the foundation — a deterministic OpenEnv for cascading microservice failures with queueing-theory propagation. In round two we productized: schema drift in observability APIs so brittle parsers die and semantic readers live; a virtual SRE manager that must approve the highest-blast actions; a curriculum that makes incidents harder as your rolling reward improves; and sub-reward structure so GRPO sees real gradients, not mode collapse. Same HTTP surface the judges can hit from our Space — same seeds, stricter world.”
18
+
19
+ `[Brackets — visual: `assets/architecture.md` mermaid or exported diagram; four quick labels on screen matching drift / oversight / curriculum / sub-rewards. Pace: ~5–7 s per upgrade.]`
20
+
21
+ **On-screen (each ~4 s):** `Schema drift` · `Oversight` · `Adversarial curriculum` · `Fine-grained sub-rewards`
22
+
23
+ ---
24
+
25
+ **0:45–1:10 — Training + evidence**
26
+ *Spoken (~95 words):*
27
+ “We collected expert runs from frontier models, SFT-warmed Llama-3.1-8B on LoRA, then ran GRPO through the live environment with group-relative advantages — not a static DPO pair dataset. The curve you care about is mean reward against training step: a floor for the untrained 8B, a ceiling at 0.929 from Gemini on our reference aggregate, and our run climbing in between. The shaded area is the learning delta in points. Inflections line up with inspect-then-act behavior instead of random restarts.”
28
+
29
+ `[Brackets — visual: `assets/reward_curve.png` full width; pointer or circle on shaded delta and two inflection callouts. Optional split: left half = one bad step trace, right half = trained trace — from `assets/before_after.md`.]`
30
+
31
+ **On-screen:** `SFT → GRPO` · `K rollouts / group` · `+Δ = __FILL__ pts` *(replace at H+15)*
32
+
33
+ ---
34
+
35
+ **1:10–1:25 — Capstone + links**
36
+ *Spoken (~60 words):*
37
+ “This is now a reusable benchmark: environment on Hugging Face, Trackio for metrics, 8B adapter on the Hub, open training scripts, and a dataset of expert trajectories. Install with pip or pull the container — validate with OpenEnv — reproduce the curves. SevZero is the room where the next on-call model trains before it touches your graph.”
38
+
39
+ `[Brackets — visual: static end card with QR or URLs — `mist-ic/sevzero-env`, `mist-ic/sevzero-trackio`, `mist-ic/sevzero-llama3-8b-grpo`, `mist-ic/sevzero-expert-trajectories` — and GitHub.]*
40
+
41
+ **On-screen (end card):** `Space` · `Trackio` · `Model` · `Dataset` · `github.com/mist-ic/SevZero`
42
+
43
+ ---
44
+
45
+ **Total:** ~320 words (comfort band 280–360); trim the middle paragraph by ~20 words if the VO runs long.
46
+
47
+ **Audio note:** one music bed allowed under VO at -18 dB; duck to silence on the autopsy first sentence if using music.
assets/architecture.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture diagram (Mermaid)
2
+
3
+ Use this as the editable source. GitHub and Hugging Face render the same Mermaid subset as `README.md`.
4
+
5
+ ```mermaid
6
+ flowchart TB
7
+ subgraph LLM[Agent]
8
+ P[Llama-3.1-8B + LoRA]
9
+ end
10
+ API[HTTP / OpenEnv API]
11
+ subgraph Core[SevZero core]
12
+ SIM[Simulator + propagation + grader]
13
+ end
14
+ subgraph R2[Round 2 modules]
15
+ SD[Schema drift\nmiddleware on inspect_*]
16
+ GOV[Oversight\nhigh-impact action gate]
17
+ CUR[Adversarial curriculum\ndifficulty / budget / topology]
18
+ end
19
+ P <--> API
20
+ API <--> SIM
21
+ API <--> SD
22
+ API <--> GOV
23
+ API <--> CUR
24
+ SD -.-> SIM
25
+ GOV -.-> SIM
26
+ CUR -.-> SIM
27
+ ```
28
+
29
+ **Narration line:** the agent only sees HTTP; the simulator is the world model; R2 injects non-stationarity (drift), safety (oversight), and harder scenarios (curriculum) without breaking determinism of a fixed seed for the same code version.
assets/before_after.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Before / after: episode traces
2
+
3
+ Sourced from `training/eval.py` JSONL output (one JSON object per step). **Replace the tables below** with two real runs on the same task and seed: baseline checkpoint vs best GRPO checkpoint, held-out seed.
4
+
5
+ | | Untrained (baseline 8B) | GRPO-trained 8B |
6
+ |---|------------------------|-------------------|
7
+ | **Task / seed** | `__FILL__` / `__FILL__` | `__FILL__` / `__FILL__` |
8
+ | **Final score** | `__FILL__` | `__FILL__` |
9
+ | **Steps used** | `__FILL__` / `__FILL__` | `__FILL__` / `__FILL__` |
10
+ | **Termination** | `__FILL__` | `__FILL__` |
11
+
12
+ ## Untrained: representative failure mode
13
+
14
+ *Draft narrative — align to actual first bad action in JSONL (e.g. high-impact restart without inspection).*
15
+
16
+ 1. `__STEP_0__` — Observation: SLO `__FILL__`, critical services: `__FILL__`.
17
+ 2. `__STEP_1__` — `inspect_logs` on wrong service; reward noise; no root cause.
18
+ 3. `__STEP_k__` — `restart_service` on `__FILL__` without approval / wrong target; cascade widens.
19
+ 4. Late `noop` or thrash; timeout or sub-threshold SLO at end state.
20
+
21
+ ## GRPO: matched scenario
22
+
23
+ *Draft — show inspect → verify cascade → low-risk fix → optional oversight path.*
24
+
25
+ 1. `__STEP_0__` — Same seed; SLO and topology identical to column one.
26
+ 2. `__STEP_1–3__` — `inspect_metrics` / `inspect_logs` on `__FILL__` to confirm failure class.
27
+ 3. `__STEP_4__` — Remediation: `__FILL__` (e.g. `rollback_service`, `tune_config`, or approval flow for primary DB).
28
+ 4. Recovery ticks; final SLO `__FILL__`; score `__FILL__`.
29
+
30
+ ---
31
+
32
+ **JSONL field hints for extraction:** for each line, read `observation` / `action` / `reward` / `step` (exact keys follow `eval.py` output). Keep excerpts under 40 lines per column when pasting into the blog or video B-roll.
assets/fixtures/sample_eval_results.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ task,baseline,sft,grpo,frontier
2
+ easy,0.71,0.85,0.90,0.93
3
+ medium,0.72,0.86,0.91,0.97
4
+ hard,0.60,0.70,0.80,0.887
assets/fixtures/sample_metrics.jsonl ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {"step": 0, "reward_mean": 0.62}
2
+ {"step": 20, "reward_mean": 0.64}
3
+ {"step": 50, "reward_mean": 0.71}
4
+ {"step": 100, "reward_mean": 0.78}
5
+ {"step": 150, "reward_mean": 0.84}
6
+ {"step": 200, "reward_mean": 0.86}
7
+ {"step": 250, "reward_mean": 0.88}
8
+ {"step": 300, "reward_mean": 0.89}
assets/recording_checklist.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Video recording checklist
2
+
3
+ ## Capture
4
+
5
+ - **Tool:** OBS Studio (recommended, free) or equivalent; record display + system audio if you add UI sounds.
6
+ - **Resolution / framerate:** 1920×1080, 60 fps.
7
+ - **Audio:** clear voice, no room noise; record a 10 s noise profile if using noise suppression.
8
+ - **Inputs:** full screen or window around terminal + browser; avoid unreadable font sizes (terminal ≥ 14 pt equivalent).
9
+
10
+ ## B-roll (get each clip 8–20 s, trim in edit)
11
+
12
+ 1. Terminal: GRPO job streaming logs (`reward`, `step`, `entropy` lines visible).
13
+ 2. Trackio (main Space): live run dashboard, one pan across key panels.
14
+ 3. HF Space: SevZero environment UI or API flow stepping through an episode.
15
+ 4. HF Model card: `mist-ic/sevzero-llama3-8b-grpo` (name, base model, adapter, links).
16
+ 5. Optional: one cut of `assets/reward_curve.png` full screen for a static beat (curve + annotations + learning delta).
17
+
18
+ ## Edit
19
+
20
+ - **Pace:** hard cuts, no long idle holds; target under 2 minutes total.
21
+ - **Accessibility:** burn in subtitles (YouTube or editor captions export to SRT and bake-in for HF if required).
22
+ - **Overlays:** use exact lines from `VIDEO_SCRIPT.md` for on-screen text; keep contrast AA-friendly.
23
+
24
+ ## Export
25
+
26
+ - **Container:** H.264 or VP9, 1080p, bitrate sufficient for screen text (avoid heavy compression artifacts on log output).
27
+ - **Thumb:** static frame = reward curve or split before/after, not a generic stock image.
assets/reward_curve.py ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Plot GRPO reward vs step from a metrics.jsonl (one JSON object per line).
4
+
5
+ Non-negotiable visual bar:
6
+ - Faint horizontal dashed: untrained 8B baseline (see --baseline).
7
+ - Faint horizontal dashed: frontier ceiling 0.929 (Gemini-3.1-Pro aggregate).
8
+ - High-contrast curve: reward mean vs step.
9
+ - Shaded region between baseline and the curve, labeled with +learning delta to final point.
10
+ - 2-3 inflection markers (slope/peak heuristics); edit captions in ORCHESTRATION when real data lands.
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import argparse
16
+ import json
17
+ from pathlib import Path
18
+
19
+ import matplotlib.pyplot as plt
20
+ import numpy as np
21
+
22
+ # Output layout: 1920x1080 at dpi=160
23
+ FIG_W_IN = 1920 / 160
24
+ FIG_H_IN = 1080 / 160
25
+ DPI = 160
26
+ OUT_PNG = Path(__file__).resolve().parent / "reward_curve.png"
27
+ FRONTIER = 0.929
28
+
29
+ # Default baseline: Consensus table "weak" aggregate until measured 8B zero-shot is available.
30
+ BASELINE_DEFAULT = 0.76
31
+
32
+ CURVE_COLOR = "#0b3d5c"
33
+ FILL_COLOR = "#1f77b4"
34
+ FRONTIER_STYLE = {"color": "#b0b0b0", "linestyle": "--", "linewidth": 1.5, "zorder": 1}
35
+ BASELINE_STYLE = {"color": "#a0a0a0", "linestyle": "--", "linewidth": 1.5, "zorder": 1}
36
+
37
+ INFLECTION_CAPTIONS = [
38
+ "Step {step}: inspect-before-restart pattern emerges",
39
+ "Step {step}: steeper SLO recovery segment",
40
+ "Step {step}: policy stabilizes (advantage spread drops)",
41
+ ]
42
+
43
+
44
+ def _parse_line(obj: dict, line_idx: int) -> tuple[int | None, float | None]:
45
+ step = None
46
+ for k in ("step", "global_step", "train/global_step", "current_step"):
47
+ if k in obj and isinstance(obj[k], (int, float)):
48
+ step = int(obj[k])
49
+ break
50
+ if step is None:
51
+ step = line_idx
52
+
53
+ r = None
54
+ for k in (
55
+ "reward_mean",
56
+ "mean_reward",
57
+ "rewards/mean",
58
+ "eval_reward",
59
+ "reward",
60
+ ):
61
+ v = obj.get(k)
62
+ if isinstance(v, (int, float)):
63
+ r = float(v)
64
+ break
65
+ if r is None and "log" in obj:
66
+ # Some exporters nest metrics
67
+ log = obj["log"]
68
+ if isinstance(log, dict):
69
+ for k in ("reward_mean", "mean_reward", "train/reward"):
70
+ if k in log and isinstance(log[k], (int, float)):
71
+ r = float(log[k])
72
+ break
73
+ return step, r
74
+
75
+
76
+ def load_metrics(path: Path) -> tuple[np.ndarray, np.ndarray]:
77
+ steps_list: list[int] = []
78
+ rewards: list[float] = []
79
+ with path.open(encoding="utf-8") as f:
80
+ for i, line in enumerate(f):
81
+ line = line.strip()
82
+ if not line:
83
+ continue
84
+ try:
85
+ obj = json.loads(line)
86
+ except json.JSONDecodeError:
87
+ continue
88
+ st, r = _parse_line(obj, i)
89
+ if r is not None:
90
+ steps_list.append(st if st is not None else i)
91
+ rewards.append(r)
92
+ if not rewards:
93
+ raise SystemExit(
94
+ f"No parseable reward fields in {path}. Expected keys like reward_mean, mean_reward, reward."
95
+ )
96
+ order = np.argsort(steps_list)
97
+ s = np.array(steps_list, dtype=int)[order]
98
+ y = np.array(rewards, dtype=float)[order]
99
+ return s, y
100
+
101
+
102
+ def smooth_moving(y: np.ndarray, w: int) -> np.ndarray:
103
+ if w < 2 or len(y) < w:
104
+ return y.astype(float)
105
+ k = np.ones(w, dtype=float) / w
106
+ return np.convolve(y, k, mode="valid")
107
+
108
+
109
+ def inflection_step_indices(
110
+ steps: np.ndarray, rewards: np.ndarray, n_max: int = 3, smooth_win: int = 7
111
+ ) -> list[int]:
112
+ """Return indices into `steps` for annotation (local max of smoothed d(reward)/d(step))."""
113
+ if len(rewards) < 4:
114
+ return []
115
+ sm = smooth_moving(rewards, min(smooth_win, max(3, len(rewards) // 5)))
116
+ if len(sm) < 3:
117
+ return [len(steps) // 2]
118
+ d = np.diff(sm)
119
+ candidates: list[int] = []
120
+ for j in range(1, len(d) - 1):
121
+ if d[j] > d[j - 1] and d[j] > d[j + 1] and d[j] > 0:
122
+ # map back to full index approx
123
+ off = (len(rewards) - len(d) - 1) // 2
124
+ idx = j + 1 + off
125
+ idx = int(np.clip(idx, 0, len(steps) - 1))
126
+ candidates.append((d[j], idx))
127
+ candidates.sort(key=lambda t: t[0], reverse=True)
128
+ out: list[int] = []
129
+ for _, idx in candidates:
130
+ if idx not in out:
131
+ out.append(idx)
132
+ if len(out) >= n_max:
133
+ break
134
+ if not out and len(steps) > 0:
135
+ out = [len(steps) // 3, 2 * len(steps) // 3][: min(n_max, len(steps))]
136
+ return out[:n_max]
137
+
138
+
139
+ def main() -> None:
140
+ p = argparse.ArgumentParser(description="GRPO reward curve from metrics.jsonl")
141
+ p.add_argument("metrics_jsonl", type=Path, help="Path to metrics.jsonl")
142
+ p.add_argument(
143
+ "-o", "--output", type=Path, default=OUT_PNG, help="Output PNG path"
144
+ )
145
+ p.add_argument(
146
+ "--baseline",
147
+ type=float,
148
+ default=BASELINE_DEFAULT,
149
+ help="Untrained 8B mean reward (replace with measured zero-shot; default 0.76 from weak-model table until filled).",
150
+ )
151
+ p.add_argument(
152
+ "--frontier", type=float, default=FRONTIER, help="Frontier ceiling (default 0.929)"
153
+ )
154
+ p.add_argument(
155
+ "--no-annotations", action="store_true", help="Skip inflection arrows (debug)"
156
+ )
157
+ args = p.parse_args()
158
+
159
+ steps, rewards = load_metrics(args.metrics_jsonl)
160
+ last_r = float(rewards[-1])
161
+ delta = last_r - args.baseline
162
+
163
+ plt.rcParams.update(
164
+ {
165
+ "font.size": 14,
166
+ "axes.titlesize": 20,
167
+ "axes.labelsize": 16,
168
+ "legend.fontsize": 12,
169
+ "figure.facecolor": "white",
170
+ "axes.facecolor": "white",
171
+ }
172
+ )
173
+ fig, ax = plt.subplots(figsize=(FIG_W_IN, FIG_H_IN), dpi=DPI, facecolor="white")
174
+
175
+ ax.axhline(
176
+ args.baseline, **BASELINE_STYLE, label=f"Untrained 8B baseline ({args.baseline:.3f})"
177
+ )
178
+ ax.axhline(
179
+ args.frontier, **FRONTIER_STYLE, label=f"Frontier ceiling ({args.frontier:.3f})"
180
+ )
181
+ ax.plot(
182
+ steps,
183
+ rewards,
184
+ color=CURVE_COLOR,
185
+ linewidth=2.5,
186
+ label="GRPO mean reward",
187
+ zorder=3,
188
+ )
189
+ # Shade between baseline and curve (vertical band: improve area between min/max per x)
190
+ y_low = np.minimum(rewards, args.baseline)
191
+ y_high = np.maximum(rewards, args.baseline)
192
+ ax.fill_between(
193
+ steps,
194
+ y_low,
195
+ y_high,
196
+ color=FILL_COLOR,
197
+ alpha=0.22,
198
+ zorder=2,
199
+ )
200
+ ax.text(
201
+ 0.02,
202
+ 0.12,
203
+ f"learning delta: +{delta:.3f} pts\nto step {int(steps[-1])} reward {last_r:.3f}",
204
+ transform=ax.transAxes,
205
+ fontsize=14,
206
+ verticalalignment="bottom",
207
+ bbox=dict(boxstyle="round,pad=0.35", facecolor="white", edgecolor="#333333", alpha=0.95),
208
+ )
209
+ if not args.no_annotations and len(steps) > 0:
210
+ idxs = inflection_step_indices(steps, rewards, n_max=3)
211
+ for j, i in enumerate(idxs):
212
+ if j >= len(INFLECTION_CAPTIONS):
213
+ break
214
+ sx = int(steps[i])
215
+ sy = float(rewards[i])
216
+ cap = INFLECTION_CAPTIONS[j].format(step=sx)
217
+ ax.annotate(
218
+ cap,
219
+ xy=(sx, sy),
220
+ xytext=(20, 20 + j * 18),
221
+ textcoords="offset points",
222
+ arrowprops=dict(arrowstyle="->", color="#222222", lw=1.2),
223
+ fontsize=11,
224
+ )
225
+
226
+ ax.set_xlabel("Step")
227
+ ax.set_ylabel("Reward (mean)")
228
+ ax.set_title("SevZero GRPO — reward vs step")
229
+ ax.legend(loc="lower right", framealpha=0.95)
230
+ ax.grid(True, alpha=0.3)
231
+ fig.tight_layout()
232
+ args.output.parent.mkdir(parents=True, exist_ok=True)
233
+ fig.savefig(args.output, dpi=DPI, facecolor="white", bbox_inches="tight")
234
+ plt.close(fig)
235
+ print(f"Wrote {args.output} ({FIG_W_IN*DPI:.0f}x{FIG_H_IN*DPI:.0f} @ dpi={DPI})")
236
+
237
+
238
+ if __name__ == "__main__":
239
+ main()
assets/scores_bar.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Grouped bar chart: Easy / Medium / Hard for baseline, SFT, GRPO, frontier.
4
+
5
+ Expected CSV (header required), from training/eval.py or hand-built:
6
+
7
+ task,baseline,sft,grpo,frontier
8
+ easy,0.71,0.85,0.90,0.93
9
+ medium,0.72,0.86,0.91,0.97
10
+ hard,0.60,0.70,0.80,0.887
11
+
12
+ `task` values: easy, medium, hard (case-insensitive). Numeric columns 0-1.
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import argparse
18
+ import csv
19
+ from pathlib import Path
20
+
21
+ import matplotlib.pyplot as plt
22
+ import numpy as np
23
+
24
+ DPI = 160
25
+ OUT_PNG = Path(__file__).resolve().parent / "scores_bar.png"
26
+ FIG_W_IN = 1920 / 160
27
+ FIG_H_IN = 1080 / 160
28
+
29
+ STAGES = ("baseline", "sft", "grpo", "frontier")
30
+ COLORS = ("#6c757d", "#17a2b8", "#0b3d5c", "#adb5bd")
31
+
32
+
33
+ def load_rows(path: Path) -> list[dict[str, str]]:
34
+ with path.open(newline="", encoding="utf-8") as f:
35
+ r = csv.DictReader(f)
36
+ if not r.fieldnames:
37
+ raise SystemExit("Empty CSV")
38
+ norm = {k.strip().lower(): k for k in r.fieldnames if k and k.strip()}
39
+ for c in STAGES + ("task",):
40
+ if c not in norm:
41
+ raise SystemExit(
42
+ f"CSV must include columns: task, {', '.join(STAGES)}. Got: {list(r.fieldnames)}"
43
+ )
44
+ rows: list[dict[str, str]] = []
45
+ for row in r:
46
+ d = {k: (row.get(norm[k]) or "").strip() for k in (list(STAGES) + ["task"])}
47
+ rows.append(d)
48
+ return rows
49
+
50
+
51
+ def main() -> None:
52
+ p = argparse.ArgumentParser()
53
+ p.add_argument("eval_results_csv", type=Path)
54
+ p.add_argument("-o", "--output", type=Path, default=OUT_PNG)
55
+ args = p.parse_args()
56
+
57
+ raw = load_rows(args.eval_results_csv)
58
+ order = ("easy", "medium", "hard")
59
+ by_task: dict[str, dict[str, float]] = {}
60
+ for row in raw:
61
+ t = row.get("task", "").lower().strip()
62
+ if t not in order:
63
+ continue
64
+ by_task[t] = {s: float(row[s]) for s in STAGES}
65
+ for t in order:
66
+ if t not in by_task:
67
+ by_task[t] = {s: 0.0 for s in STAGES}
68
+
69
+ plt.rcParams.update(
70
+ {
71
+ "font.size": 14,
72
+ "axes.titlesize": 20,
73
+ "axes.labelsize": 16,
74
+ "figure.facecolor": "white",
75
+ "axes.facecolor": "white",
76
+ }
77
+ )
78
+ fig, ax = plt.subplots(figsize=(FIG_W_IN, FIG_H_IN), dpi=DPI, facecolor="white")
79
+
80
+ x = np.arange(len(order))
81
+ w = 0.18
82
+ for i, stage in enumerate(STAGES):
83
+ heights = [by_task[tt][stage] for tt in order]
84
+ ax.bar(
85
+ x + (i - 1.5) * w,
86
+ heights,
87
+ width=w,
88
+ label=stage,
89
+ color=COLORS[i],
90
+ )
91
+
92
+ ax.set_xticks(x)
93
+ ax.set_xticklabels([t.capitalize() for t in order])
94
+ ax.set_ylabel("Mean score")
95
+ ax.set_ylim(0.0, 1.05)
96
+ ax.set_title("SevZero eval — by task and training stage (held-out seeds)")
97
+ ax.legend()
98
+ ax.grid(True, axis="y", alpha=0.3)
99
+ fig.tight_layout()
100
+ args.output.parent.mkdir(parents=True, exist_ok=True)
101
+ fig.savefig(args.output, dpi=DPI, facecolor="white", bbox_inches="tight")
102
+ plt.close(fig)
103
+ print(f"Wrote {args.output} ({FIG_W_IN*DPI:.0f}x{FIG_H_IN*DPI:.0f} @ dpi={DPI})")
104
+
105
+
106
+ if __name__ == "__main__":
107
+ main()
assets/training_pipeline.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training pipeline (Mermaid)
2
+
3
+ ```mermaid
4
+ flowchart LR
5
+ C[Collect 100–150 expert rollouts\nfilter score ≥ 0.85] --> S[SFT: Llama-3.1-8B-Instruct\nformatting + runbook prior]
6
+ S --> R[GRPO: group-relative advantages\nK rollouts / prompt, live env]
7
+ R --> E[Eval: easy / medium / hard\nheld-out seeds]
8
+ E --> V[Model card + reward plots\n+ bar + before/after]
9
+ ```
10
+
11
+ **Why SFT first:** valid JSON actions and a sane inspection-before-remediation style before online RL explores destructive corners.
12
+
13
+ **Why GRPO over DPO:** the signal is in multi-turn trajectories and delayed SLO effects; group normalization across rollouts for the same context fits TRL + remote OpenEnv without a static preference pair dataset.
14
+
15
+ **Why 8B:** capacity for long incidents without shipping telemetry to a third-party 70B API in a real SRE deployment; training evidence closes part of the ~0.76 (weak) → 0.929 (frontier) gap on Hard.