Spaces:
Sleeping
SevZero: from simulator to a trainable SRE war-room (Round 2)
HF blog draft — no inline hosted images; upload plots separately and replace the placeholders below.
The autopsy (hook)
At step fourteen, an untrained 8B model panicked and restarted the primary database, turning a minor latency spike into a regional outage. 300 steps later, it learned to throttle background jobs instead. This is SevZero.
That failure was not a toy bug hunt. In production, the damage lives in a few irreversible actions taken under pressure: wrong service restarted, change applied without a rollback plan, a primary store touched when a leaf service was the root cause. SevZero is built to make those mistakes expensive in simulation so policy learning can make them rare in policy.
In Round 1 we shipped a deterministic, OpenEnv-native incident simulator: queues, breakers, SLOs, and eight failure types with distinct log signatures. In Round 2 the product is not “more of the same environment.” It is a self-evolving SRE war-room — non-stationary observations, an oversight channel for the riskiest tool calls, a curriculum that tightens the incident as the agent’s rolling reward improves, and reward components dense enough for GRPO to see gradients instead of a flat line.
The environment: what is novel
Core: partial observability, delayed effects, and propagation along a service DAG. The agent never sees a labeled root cause. It can only use the same surfaces a human on-call has—metrics, logs, traces—and the same classes of actions: inspect_* diagnostics, restart_service, rollback_service, scale_service, tune_config, clear_cache, rebalance_traffic, and a few more. That matters: failures propagate through a dependency graph; circuit breakers open and close with delay; a bad restart on an upstream can look like a downstream cache miss until you read the trace.
The scalar score is a blend of SLO recovery, action efficiency, and time under budget. The simulator is deterministic for a given seed—random.Random(seed) throughout—so a GRPO run that misbehaves is debuggable, and held-out eval seeds are true generalization over topology and failure mix, not replay of the same micro-incident in disguise.
Round 2 upgrades (implementation-level):
- Schema drift — a middleware path mutates the shapes and keys of
inspect_metricsandinspect_logsresponses while exposing a small change log in the observation. Rigid string parsing fails; semantic parsing survives. This tracks real production reality: your dashboards change version without your pager updating first. - Oversight — a virtual SRE manager gates high-blast-radius actions (e.g. touching a primary data plane or draining a region at the wrong time). The model must learn when to request approval, not only what to type. That maps directly to the “weaker supervisor, stronger worker” story enterprises already run in shadow mode.
- Adversarial curriculum (lite) — as rolling performance crosses thresholds, the environment increases failure count, service count, and tightens the step budget. It is a performance-linked escalator, not a long table of hand-authored levels: the distribution of incidents shifts as the policy improves.
- Fine-grained sub-rewards — early GRPO runs hit a pattern we should own in public: the policy occasionally spammed
inspect_logsto stay inside dense shaping and avoid committing to a fix. Tightening sub-reward structure—without hiding the real terminal SLO—restored non-zero group variance so GRPO had something to backpropagate.
The training pipeline: SFT, then GRPO
Collect: 100–150 expert-style trajectories from frontier chat models, filtered to a minimum episode score (we used ≥ __FILL__).
SFT: LoRA on Llama-3.1-8B-Instruct to lock in valid function-call JSON, incident vocabulary, and a “read before you break glass” inductive bias. Approximate run: __FILL__ steps, effective batch __FILL__, LR 1e-5 (see repository training config for the exact file).
GRPO: K completions per prompt, group-relative advantages, and rollouts that hit the same HTTP OpenEnv the judges can open from a Space. The trainer does not get a hand-wavy stub reward: the FastAPI app runs the full tick engine, the grader, and the R2 modules. In TRL, wire custom rollouts through rollout_func—environment_factory is the legacy path that breaks silent on recent releases.
Infra in practice: vLLM (or a compatible server) for fast multi-completion sampling, LoRA on attention and MLP blocks for 8B, cosine LR schedule, and a 30–45 minute health window where we watch entropy, KL, and the fraction of steps with near-zero advantage standard deviation. If the curve is flat, the bug is usually integration—not “RL doesn’t work.”
High-level config that matched the GPU hours we had: rank __FILL__, LR in the 7e-6–1e-5 band, K of 4 or 8, temperature 0.85, β 0.04, 300–400 steps. The exact job JSON and dependency pins live next to train_grpo.py in the repository.
Why GRPO, not DPO? DPO needs a static preference set over pairs; the failure modes here are multi-turn and path-dependent. GRPO’s per-group normalization lets the same prompt explore multiple remediation strategies and learn from the one that actually moves SLO under delayed physics.
Why 8B? A 70B API can score near the 0.929 frontier on aggregate benchmarks, but the deployment story for a regulated network is a local policy with auditable weights. The hackathon ask is to show a believable lift on that 8B class, not to pretend 8B equals Gemini on every seed.
Results
What a judge should see in 10 seconds — a line that starts near the measured untrained-8B floor, steps upward with visible slope changes, and approaches—but may not need to meet—the frontier at 0.929 (Gemini-3.1-Pro, aggregate of 28 reference runs on our protocol). A shaded band between the floor and the curve is the learning delta in points, not a decoration.
- Frontier line: 0.929 (reference aggregate above).
- Pre-GRPO 8B floor:
__FILL__(measured zero-shot on held-out seeds 13, 99, 777 — we deliberately avoid 42/123/7 that appeared in early baselines). - Post-GRPO:
__FILL__at step__FILL__(frommetrics.jsonl); learning delta+__FILL__points in the figure above. Inflection captions are drafted fromassets/reward_curve.pyheuristics and edited against the run log for the final asset.
Per-tier bars are more legible to humans than a single scalar. Easy should look boring (everyone is high); Hard is where a weak policy collapses. That is the column we expect improvement to show up first if anything does.
Before/after (same task and seed) is the human-readable twin of the curve: one JSONL line per step with action and observation text. The repository’s assets/before_after.md is the working template; the final post will include one medium and one hard excerpt once eval lands.
Lessons and failure modes (honest)
- Reward hacking (inspect loop): a short run spiked by spamming
inspect_logsto farm dense shaping without remediating. We addressed it with repetition-style penalties in the sub-reward terms and a stronger terminal SLO term so “busy work” could not outscore a resolved incident. - Zero-advantage batches: if every completion in a group gets the same return, GRPO has nothing to differentiate. The fine-grained sub-rewards and curriculum variance exist partly to keep group standard deviation alive.
- What still breaks:
__FILL__(e.g. multi-region + simultaneous independent root causes in the Hard tier) — the honest answer in Q&A is that this is the next curriculum axis, not a reason to hand-wave the current metrics.
Reuse
pip install/uv syncand Docker as in the GitHubREADME.md.- OpenEnv schema and validation: the Space exposes the same routes evaluators expect.
- Main Hub links (when live):
mist-ic/sevzero-env·mist-ic/sevzero-trackio·mist-ic/sevzero-llama3-8b-grpo·mist-ic/sevzero-expert-trajectories
Thanks to the OpenEnv team, Hugging Face TRL, and Unsloth for the post-training stack this round actually shipped on.

