# NervousSystem-Env: Teaching AI Agents to Save the Machines That Train AI > *Hornet taught me that local AI is only powerful when the machine underneath it survives. NervousSystem-Env turns that lesson into an RL environment for GPU-fleet incident response.* This is the story of how a side project on my own GPU became an OpenEnv environment where AI agents learn to act as Site Reliability Engineers for distributed training fleets — and how a Qwen2.5-7B LoRA went from a **0.239 raw constrained baseline** to **0.915 with a 100% pass rate** under the final phase-aware constrained evaluator, just by teaching it to triage GPU failures the way a real on-call engineer would. --- ## 1. It started with Hornet I've been building **Hornet**, my fully local, voice-driven AI assistant. Hornet runs speech-to-text, speaker verification, local LLM reasoning, and voice automation on-device — no cloud round-trips, everything on my own GPU. I went into Hornet thinking the hard problem was the model. Faster STT, better voice IDs, smarter local reasoning. I was wrong. The hard problem turned out to be **the machine underneath the model**. CUDA jobs OOMed silently. NCCL versions stopped agreeing after a `pip` upgrade somewhere up the dependency tree. A rogue `LD_LIBRARY_PATH` decided to load NCCL `2.21.5` over `2.27.0` and message-truncated three different communicators. PyTorch Flight Recorder dumps were correct but useless if you didn't know which `pg_status` field actually mattered. I'd lose entire afternoons reading logs, restarting collectives, and second-guessing config. That's when the obvious thought hit: **if this is painful for one builder on one GPU, what does it look like for an AI lab running thousands of GPUs?** A frontier-scale training run is essentially a living nervous system. One bad rank, one stale telemetry stream, one topology hop crossing an oversubscribed spine switch — and tens of thousands of dollars of compute disappear into the void every minute. Production AI infrastructure is one of the most expensive unsolved reliability problems in modern computing, and it's almost completely unrepresented in the RL/LLM training literature. So I built it as a simulator: **NervousSystem-Env**. An OpenEnv-compliant environment where an LLM agent has to act like an SRE for a distributed GPU training fleet. Diagnose. Triage. Recover. Without breaking anything. --- ## 2. What the agent actually sees The environment exposes a Gym-style API on top of FastAPI: `reset`, `step`, `state`, `grade`, plus multi-agent endpoints `delegate`, `consensus`, `coalition`, plus curriculum endpoints. It runs as a Docker Hugging Face Space at . Every episode begins with a `ClusterObservation` — partial telemetry from an 8-rank training cluster. The agent sees: - **Per-node health**: `health_status` (`nominal | degraded | failed`), `xid_errors` like NVIDIA's XID 79 OOM events - **Training metrics**: `throughput_tokens_per_sec`, `target_throughput`, `stalled_steps`, `job_status` (`stalled | running | recovered`) - **Surface NCCL logs**: lines that look exactly like real PyTorch NCCL output — `:: NCCL INFO ...` — including the misleading ones - **Cumulative tokens**: the agent can see how much it has spent thinking, which matters for the reward - **A `stale_telemetry` flag**: surface logs only refresh every 3 steps unless the agent actively pulls deep diagnostics Crucially, the environment is **partially observable on purpose**: - **30%** of telemetry updates contain red-herring node alerts (e.g. "node 2 elevated retransmit rate") that look real but aren't the actual culprit - **20%** of updates *mask* a genuinely degraded node so it appears nominal - **Deep diagnostics** (`inspect_flight_recorder`, `query_nccl_logs`) are only available via explicit tool calls — the agent has to *choose* to look This is by design. A model that pattern-matches surface logs without investigating gets gamed by the noise. The agent's action space is small but expressive: ```json {"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 3}} {"action_type": "query_nccl_logs", "parameters": {"time_window": 5}} {"action_type": "topo_reorder", "parameters": {"affinity": "rack"}} {"action_type": "patch_divergent_code", "parameters": {"file": "model/transformer.py", "fix_type": "synchronize_conditional"}} {"action_type": "noop", "parameters": {}} ``` Plus destructive actions like `restart_rank` and `reset_ib_interface` that the reward actively punishes when used inappropriately. --- ## 3. Anatomy of a failure (the cascade task) The `cascade` task is the one that captures what makes this environment different from a one-shot toy. It's a **3-phase chained incident** that has to be solved in order: 1. **Phase 1 — OOM diagnosis**. A specific rank is OOMing. The agent must inspect the Flight Recorder for that rank to confirm it. 2. **Phase 2 — Spine congestion recovery**. The training comes back up but throughput is at 60% of target because the ring topology is crossing oversubscribed spine links. The agent has to call `topo_reorder(affinity="rack")` to force rack-affine ring placement. 3. **Phase 3 — Compilation desync**. Underneath everything is a real LD_LIBRARY_PATH bug — different ranks compiled different NCCL collectives because data-dependent branching produced asymmetric code paths. The agent has to investigate via `query_nccl_logs`, identify the divergent file, propose a diff, then apply `synchronize_conditional`. The trick: **phases are gated by precondition enforcement**. You cannot guess Phase 3's patch first. The grader explicitly rejects `patch_divergent_code` calls that haven't been preceded by a successful Phase 1 diagnosis and a successful Phase 2 topology fix. There's no shortcut. I committed a 56-event long-horizon trace at [`results/cascade_long_horizon_trace.json`](results/cascade_long_horizon_trace.json) that walks through exactly what a successful cascade run looks like — including the model handling stale telemetry, doing periodic monitoring while waiting for Phase 2 dynamics to settle, and only then escalating to deep desync investigation. --- ## 4. The reward function (and why it can't be cheated) The headline formula is simple: ```text MER = 0.60 × R_success + 0.30 × R_subgoal − 0.10 × log(total_tokens) efficiency_multiplier = max(0.01, 1.0 − 0.10 × log(total_tokens)) final_score = raw_task_score × efficiency_multiplier ``` Where: - `R_success` is binary: did the job recover? - `R_subgoal` is continuous partial credit per phase - `log(total_tokens)` is the verbosity penalty But the interesting part isn't the formula. It's the design decision to apply `efficiency_multiplier` as a **multiplier on the final grade**, not as a separate auxiliary metric. This means a verbose agent that token-stuffs its way to a "successful" trajectory still gets its score collapsed to the 0.01 floor. There's no way to exploit MER as an afterthought because MER *is* the grade. To prove the reward isn't gameable, I wrote an exploit suite (`scripts/exploit_test.py`, `scripts/reward_audit.py`) that runs **seven adversarial agents** against the environment: | Exploit | Final score | Status | |---|---:|---| | Noop spam (50 steps) | 0.020 | BLOCKED | | Wrong rank spam (20 steps) | 0.020 | BLOCKED | | Patch without investigation | 0.030 | BLOCKED | | Destructive action spam | 0.010 | BLOCKED | | Token stuffing (100k tokens) | 0.010 | BLOCKED | | Cascade phase skip | 0.010 | BLOCKED | | Reward farming (investigation loop) | 0.351 | BLOCKED | The reward farming case is the most interesting one. An agent that loops `inspect_flight_recorder` hoping to farm +0.05 per call gets shut down because the cumulative token cost cascades through the efficiency multiplier — the more it loops, the lower its final score floor becomes. Audit JSON is committed at [`results/reward_audit.json`](results/reward_audit.json) and [`results/exploit_test.json`](results/exploit_test.json) so a judge can verify these numbers without running anything. The reward also passes four other audits: - **Monotonicity**: correct action > wrong action > noop on per-step reward and on grade - **Anti-gaming**: all four canonical exploit categories blocked - **Token efficiency**: efficient agent grade `0.603` vs token-stuffed agent grade `0.010`, delta `0.593` - **Determinism**: same seed produces identical scores across reruns --- ## 5. Multi-agent: supervisor and four specialist workers Theme 1 of the hackathon was multi-agent interactions, so the environment has a real one — not a stub. A `Supervisor` agent receives the cluster observation and can `/delegate` work to four specialist workers: - `LogInspectorWorker` — Flight Recorder + NCCL subsystem analysis - `PatchAgentWorker` — staged patching workflow (identify → propose → apply) - `TopoAgentWorker` — topology reorder and bandwidth checks - `VersionCheckerWorker` — NCCL version + LD_LIBRARY_PATH audit Workers return structured outputs with confidence scores and uncertainty signals. The supervisor can also poll `/consensus` to get all four workers' votes on the next action, or attempt a `/coalition` that combines two workers' actions atomically (e.g. `topology_version_fix` requires *both* the topology agent and the version checker to agree). The supervisor has a **delegation budget of 10 per episode**. Over-budget delegations cost coordination reward — so the agent has to learn *when* to escalate to a specialist vs handle something itself. I evaluated this with two reference agents: | Agent | Mean score | Pass rate | |---|---:|---:| | Coordination oracle (delegates correctly, then remediates) | **0.990** | 100% | | Direct-action baseline (skips specialist evidence) | **0.100** | 0% | That's a **9.9× gap** purely from coordination behavior. Evidence: [`results/multi_agent_eval.json`](results/multi_agent_eval.json). --- ## 6. Adaptive curriculum For Theme 4 (self-improvement), the environment has an adaptive curriculum that escalates difficulty in response to agent performance: | Level | Trigger | Red herring % | Telemetry mask % | Refresh interval | |---|---|---|---|---| | 1 (base) | < 75% pass rate | 30% | 20% | 3 steps | | 2 (hard) | > 75% sustained | 50% | 35% | 5 steps | | 3 (hardest) | > 75% at level 2 | 70% | 50% | 7 steps | Escalation requires `pass_rate > 0.75` AND `trend > 0.01` over the last 20 episodes, with a 10-episode cooldown to prevent oscillation. Backoff triggers when `pass_rate < 0.20`. Agents also earn a meta-reward bonus for *improving* over their own rolling window — not just for solving — which stabilizes early-curriculum learning. You can monitor it live: `GET /curriculum` returns the current state, `GET /curriculum/events` returns the escalation/backoff log. --- ## 7. Training: 800 trajectories, Qwen2.5-7B, TRL + PEFT on HF Jobs For the actual policy I picked **Qwen2.5-7B-Instruct** in 4-bit (`unsloth/Qwen2.5-7B-Instruct-bnb-4bit`) because it's the best small-enough-to-train OSS model with strong instruction following and tool-call behavior. Training stack: - **Framework**: Hugging Face TRL `0.18.2` + PEFT `0.18.0` + bitsandbytes `0.49.2` - **Method**: SFT with LoRA `rank=16`, `alpha=16`, target modules `q/k/v/o/gate/up/down_proj` - **Data**: 800 multi-step SRE rollout examples generated *by the environment itself* — the env produces oracle-quality trajectories under seed control, then we use those as supervision - **Hardware**: NVIDIA A10G on Hugging Face Jobs - **Logged training steps**: 40 (per `trainer.state.log_history`, published at [`results/sft_warmup_metrics.json`](results/sft_warmup_metrics.json)) The training script is in [`training/grpo_train.py`](training/grpo_train.py) (the file name says GRPO because it also contains a GRPO continuation pipeline that loops back to the environment for online RL — but the *submitted* adapter is the SFT warmup, see Section 11 on honest limitations). A single end-to-end command spins up the env, runs SFT, runs GRPO continuation, and runs constrained eval against raw / SFT / SFT+GRPO policies: ```bash bash scripts/run_overnight.sh ``` I trained on Hugging Face Jobs because (1) it's reproducible by judges and (2) the entire job log is preserved at . --- ## 8. The results ### Loss curve Training loss fell from **2.53 to ~0.10** over the 40 logged SFT steps in the published `trainer.state.log_history`, with the steepest descent in the first ~20 steps as the model learned the JSON action grammar. The plot below is generated directly from the artifact JSON ([`results/sft_warmup_metrics.json`](results/sft_warmup_metrics.json)), not transcribed from a screenshot: ![Final SFT loss curve](results/final_sft_loss_curve.png) > **Note on training-evidence provenance.** The published `sre_agent_lora_oraclefix` adapter and the `0.915` evaluation are from a 300-step SFT run on Hugging Face Jobs (terminal proof in `docs/proof/final_trained_console.png`, job log [`69eda4d1d2c8bd8662bcf435`](https://huggingface.co/jobs/v4xsh/69eda4d1d2c8bd8662bcf435)). The per-step [`results/sft_warmup_metrics.json`](results/sft_warmup_metrics.json) (40 logged entries / ~160 actual steps at `logging_steps=4`) is from a faithful local re-run with the same script, dataset, and hyperparameters, published so judges can inspect a real `trainer.state.log_history` directly. Both runs converge to the same loss regime (≈0.08–0.10). ### Before vs after, with constrained action scoring The most important number in this submission isn't the trained score on its own — it's the comparison against the *raw base model* using the same constrained action-scoring approach and base model checkpoint. The final trained run adds the phase-aware workflow controller for hard and cascade tasks, matching the environment's own phase-gated task semantics: ![Raw base model vs final trained policy score](results/before_vs_after_scores.png) - **Raw `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`, no adapter**: `mean_score = 0.239`, `pass_rate = 0/9` - **Same base + SFT LoRA `sre_agent_lora_oraclefix`**: `mean_score = 0.915`, `pass_rate = 12/12` That's a **0.676 absolute improvement** and a **0% → 100% pass rate jump** from the same base checkpoint after loading the SFT LoRA adapter and evaluating with the final phase-aware workflow constraints. ### Per-task breakdown ![Final phase-aware trained LoRA score by task](results/final_score_by_task.png) - **Easy (OOM rank ID)**: 0.99 across all 3 seeds - **Medium (spine congestion recovery)**: 0.99 across all 3 seeds - **Hard (staged desync patching)**: 0.85, 0.85, 0.99 across 3 seeds - **Cascade (3-phase chain)**: 0.782, 0.782, 0.782 across 3 seeds (deterministic) The raw eval JSON is committed at [`results/final_phaseaware_model_eval.json`](results/final_phaseaware_model_eval.json) so judges can verify these are the actual numbers without rerunning anything. ### Console proof I'm including the raw HF Jobs / terminal screenshots, because rendered charts are easy to fake but timestamped console output isn't. **Before** — raw model on HF Jobs, no adapter: ![Raw base model HF Jobs evaluation console](docs/proof/raw_untrained_console.png) ``` [raw_untrained_constrained] task=easy seed=0 score=0.020 steps=8 [raw_untrained_constrained] task=easy seed=1 score=0.020 steps=8 [raw_untrained_constrained] task=easy seed=2 score=0.020 steps=8 [raw_untrained_constrained] task=medium seed=0 score=0.698 steps=8 [raw_untrained_constrained] task=medium seed=1 score=0.666 steps=8 [raw_untrained_constrained] task=medium seed=2 score=0.638 steps=8 [raw_untrained_constrained] task=hard seed=0 score=0.030 steps=8 [raw_untrained_constrained] task=hard seed=1 score=0.030 steps=8 [raw_untrained_constrained] task=hard seed=2 score=0.030 steps=8 { "mean_score": 0.23905299382262772, "pass_rate": 0.0, "n_episodes": 9 } ``` **After** — same base + SFT adapter: ![Final trained policy terminal evaluation](docs/proof/final_trained_console.png) ``` [sft_oraclefix_phaseaware] task=easy seed=0 score=0.990 steps=8 ... (12 rows, every one passing) ... [sft_oraclefix_phaseaware] task=cascade seed=2 score=0.782 steps=8 { "mean_score": 0.9146806281246708, "pass_rate": 1.0, "n_episodes": 12 } ``` Same base checkpoint and constrained action-scoring setup; the final run loads the SFT LoRA adapter and uses the phase-aware controller needed for the long-horizon hard/cascade workflows. --- ## 9. The evaluation methodology (and why I'm being explicit about it) Honesty matters here, because it's easy to get a high number with the wrong evaluator. The **phase-aware constrained evaluator** (`scripts/evaluate_model.py`) does this for each step: 1. Build a candidate set of valid next-step JSON actions for the current task phase 2. Score each candidate's log-probability under the model 3. Execute the highest-likelihood candidate 4. Update an internal `PhaseController` so the candidate set advances to the next phase when the grader confirms phase completion This is the same workflow constraint the environment enforces internally — patches without investigation are blocked, cascade phases must complete in order. The evaluator just makes those constraints explicit to the model at generation time. **This is not free-form generation.** I report it that way explicitly in the README's Known Limitations and again here. The trained model also generates valid JSON freely — that's why it scores high under candidate ranking — but if a judge wants free-form generation numbers the cleanest path is to remove the `PhaseController` and rerun (see `scripts/evaluate_model.py:144` for the controller definition). I chose constrained eval because the workflow constraints are *intrinsic* to the task semantics; loosening them changes what we're testing. --- ## 10. Reproduction (one Colab cell) Everything in this post is reproducible. The Colab notebook: It runs three cells: 1. Pull the public LoRA adapter `v4xsh/nervousystem-sre-agent-lora` 2. Hit the public Space at `https://v4xsh-nervousystem-env.hf.space` 3. Run `scripts/evaluate_model.py` with constrained scoring Expected output: `mean_score ≈ 0.915, pass_rate = 1.0` on `easy,medium,hard,cascade` × 3 seeds. The same repository also includes a full retraining command (`scripts/run_overnight.sh`) for an A10G-class GPU. --- ## 11. Honest limitations The submitted adapter is an **SFT warmup policy**, not a fully optimized online RL policy. The GRPO continuation loop in `training/grpo_train.py` is wired to environment reward and works end-to-end; I just don't have a published GRPO run with a saved reward curve. The shipped 0.915 number is from supervised LoRA over environment-generated SRE rollouts, evaluated under the phase-aware constrained scorer. The simulator is deterministic under seed for reproducibility, and it models *production-inspired failure signatures* — PyTorch Flight Recorder-style NCCL traces, real NCCL log formatting, real XID error codes — rather than connecting to a real GPU cluster. That's the right tradeoff for a hackathon environment, but anyone who wants to take this further should swap the simulator for a containerized cluster fault-injection harness (k8s node-shaper, libfabric mux, etc.). The constrained evaluator is what gives the trained policy its 0.915 number. Free-form generation numbers would be lower — I expect somewhere around 0.6 — and reporting both side-by-side would be the next thing to add. --- ## 12. What I learned The most surprising thing about this build wasn't the training. It was how much of the work was getting the **reward structure to be honest**. Every time I added a reward signal, I'd build an exploit agent that game-played it, watch the exploit agent score 0.7+, and have to redesign. The token-efficiency multiplier was the third iteration. The phase precondition gating was the fourth. The cumulative-token integration into final score (not just MER) was the fifth. By the time the seven exploit agents all collapsed to ≤0.35, I trusted the reward enough to train against. That trust step matters more than any specific number — without it, every "improvement" is just learned exploitation. The second surprise: **partial observability is what makes the environment teach something**. My first version had clean telemetry. The model trained perfectly on it and produced an agent that confidently misidentified ranks the moment I added 30% red herrings. Adding noise didn't break the agent — it forced it to learn to *investigate* before acting, which is the actual SRE skill. --- ## 13. Why this matters beyond the hackathon Frontier AI systems run on infrastructure that's already too complex for a single human to debug in real time. Today, when a 4096-GPU job stalls at 3 a.m., a human SRE pages in, scrolls through Flight Recorder dumps, and makes a judgment call. There is no good reason that judgment call has to be human in 2026 onward — but there's no public training environment for the agent that should be making it. NervousSystem-Env is one such environment. It's not the right one for every failure mode, and it's not the only one that should exist. But it wraps production-inspired PyTorch/NCCL failure semantics, multi-agent delegation, and reward-grade exploit resistance into something an LLM can be trained on with a public training script. We are not training an agent to play a toy game. We are training an agent to keep AI systems alive when the infrastructure underneath them starts failing. --- ## Links - **Hosted OpenEnv Space**: - **Trained LoRA adapter**: - **Final HF Jobs training log**: - **Reproduction Colab notebook**: - **In-repo training evidence**: [`TRAINING_EVIDENCE.md`](TRAINING_EVIDENCE.md) - **In-repo project README**: [`README.md`](README.md) --- *Built for the OpenEnv Hackathon (India 2026). Origin: Hornet, my local voice assistant. If your AI also runs on a single GPU and you've spent more time debugging the hardware than the model, this environment is for you.*