nervousystem-env / Blog.md
vx7sh's picture
docs: publish real SFT log_history and add provenance note
93f985b

NervousSystem-Env: Teaching AI Agents to Save the Machines That Train AI

Hornet taught me that local AI is only powerful when the machine underneath it survives. NervousSystem-Env turns that lesson into an RL environment for GPU-fleet incident response.

This is the story of how a side project on my own GPU became an OpenEnv environment where AI agents learn to act as Site Reliability Engineers for distributed training fleets β€” and how a Qwen2.5-7B LoRA went from a 0.239 raw constrained baseline to 0.915 with a 100% pass rate under the final phase-aware constrained evaluator, just by teaching it to triage GPU failures the way a real on-call engineer would.


1. It started with Hornet

I've been building Hornet, my fully local, voice-driven AI assistant. Hornet runs speech-to-text, speaker verification, local LLM reasoning, and voice automation on-device β€” no cloud round-trips, everything on my own GPU.

I went into Hornet thinking the hard problem was the model. Faster STT, better voice IDs, smarter local reasoning. I was wrong. The hard problem turned out to be the machine underneath the model.

CUDA jobs OOMed silently. NCCL versions stopped agreeing after a pip upgrade somewhere up the dependency tree. A rogue LD_LIBRARY_PATH decided to load NCCL 2.21.5 over 2.27.0 and message-truncated three different communicators. PyTorch Flight Recorder dumps were correct but useless if you didn't know which pg_status field actually mattered. I'd lose entire afternoons reading logs, restarting collectives, and second-guessing config.

That's when the obvious thought hit: if this is painful for one builder on one GPU, what does it look like for an AI lab running thousands of GPUs?

A frontier-scale training run is essentially a living nervous system. One bad rank, one stale telemetry stream, one topology hop crossing an oversubscribed spine switch β€” and tens of thousands of dollars of compute disappear into the void every minute. Production AI infrastructure is one of the most expensive unsolved reliability problems in modern computing, and it's almost completely unrepresented in the RL/LLM training literature.

So I built it as a simulator: NervousSystem-Env. An OpenEnv-compliant environment where an LLM agent has to act like an SRE for a distributed GPU training fleet. Diagnose. Triage. Recover. Without breaking anything.


2. What the agent actually sees

The environment exposes a Gym-style API on top of FastAPI: reset, step, state, grade, plus multi-agent endpoints delegate, consensus, coalition, plus curriculum endpoints. It runs as a Docker Hugging Face Space at https://huggingface.co/spaces/v4xsh/nervousystem-env.

Every episode begins with a ClusterObservation β€” partial telemetry from an 8-rank training cluster. The agent sees:

  • Per-node health: health_status (nominal | degraded | failed), xid_errors like NVIDIA's XID 79 OOM events
  • Training metrics: throughput_tokens_per_sec, target_throughput, stalled_steps, job_status (stalled | running | recovered)
  • Surface NCCL logs: lines that look exactly like real PyTorch NCCL output β€” <hostname>:<pid>:<tid> NCCL INFO ... β€” including the misleading ones
  • Cumulative tokens: the agent can see how much it has spent thinking, which matters for the reward
  • A stale_telemetry flag: surface logs only refresh every 3 steps unless the agent actively pulls deep diagnostics

Crucially, the environment is partially observable on purpose:

  • 30% of telemetry updates contain red-herring node alerts (e.g. "node 2 elevated retransmit rate") that look real but aren't the actual culprit
  • 20% of updates mask a genuinely degraded node so it appears nominal
  • Deep diagnostics (inspect_flight_recorder, query_nccl_logs) are only available via explicit tool calls β€” the agent has to choose to look

This is by design. A model that pattern-matches surface logs without investigating gets gamed by the noise.

The agent's action space is small but expressive:

{"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 3}}
{"action_type": "query_nccl_logs", "parameters": {"time_window": 5}}
{"action_type": "topo_reorder", "parameters": {"affinity": "rack"}}
{"action_type": "patch_divergent_code", "parameters": {"file": "model/transformer.py", "fix_type": "synchronize_conditional"}}
{"action_type": "noop", "parameters": {}}

Plus destructive actions like restart_rank and reset_ib_interface that the reward actively punishes when used inappropriately.


3. Anatomy of a failure (the cascade task)

The cascade task is the one that captures what makes this environment different from a one-shot toy. It's a 3-phase chained incident that has to be solved in order:

  1. Phase 1 β€” OOM diagnosis. A specific rank is OOMing. The agent must inspect the Flight Recorder for that rank to confirm it.
  2. Phase 2 β€” Spine congestion recovery. The training comes back up but throughput is at 60% of target because the ring topology is crossing oversubscribed spine links. The agent has to call topo_reorder(affinity="rack") to force rack-affine ring placement.
  3. Phase 3 β€” Compilation desync. Underneath everything is a real LD_LIBRARY_PATH bug β€” different ranks compiled different NCCL collectives because data-dependent branching produced asymmetric code paths. The agent has to investigate via query_nccl_logs, identify the divergent file, propose a diff, then apply synchronize_conditional.

The trick: phases are gated by precondition enforcement. You cannot guess Phase 3's patch first. The grader explicitly rejects patch_divergent_code calls that haven't been preceded by a successful Phase 1 diagnosis and a successful Phase 2 topology fix. There's no shortcut.

I committed a 56-event long-horizon trace at results/cascade_long_horizon_trace.json that walks through exactly what a successful cascade run looks like β€” including the model handling stale telemetry, doing periodic monitoring while waiting for Phase 2 dynamics to settle, and only then escalating to deep desync investigation.


4. The reward function (and why it can't be cheated)

The headline formula is simple:

MER = 0.60 Γ— R_success + 0.30 Γ— R_subgoal βˆ’ 0.10 Γ— log(total_tokens)
efficiency_multiplier = max(0.01, 1.0 βˆ’ 0.10 Γ— log(total_tokens))
final_score = raw_task_score Γ— efficiency_multiplier

Where:

  • R_success is binary: did the job recover?
  • R_subgoal is continuous partial credit per phase
  • log(total_tokens) is the verbosity penalty

But the interesting part isn't the formula. It's the design decision to apply efficiency_multiplier as a multiplier on the final grade, not as a separate auxiliary metric. This means a verbose agent that token-stuffs its way to a "successful" trajectory still gets its score collapsed to the 0.01 floor. There's no way to exploit MER as an afterthought because MER is the grade.

To prove the reward isn't gameable, I wrote an exploit suite (scripts/exploit_test.py, scripts/reward_audit.py) that runs seven adversarial agents against the environment:

Exploit Final score Status
Noop spam (50 steps) 0.020 BLOCKED
Wrong rank spam (20 steps) 0.020 BLOCKED
Patch without investigation 0.030 BLOCKED
Destructive action spam 0.010 BLOCKED
Token stuffing (100k tokens) 0.010 BLOCKED
Cascade phase skip 0.010 BLOCKED
Reward farming (investigation loop) 0.351 BLOCKED

The reward farming case is the most interesting one. An agent that loops inspect_flight_recorder hoping to farm +0.05 per call gets shut down because the cumulative token cost cascades through the efficiency multiplier β€” the more it loops, the lower its final score floor becomes.

Audit JSON is committed at results/reward_audit.json and results/exploit_test.json so a judge can verify these numbers without running anything.

The reward also passes four other audits:

  • Monotonicity: correct action > wrong action > noop on per-step reward and on grade
  • Anti-gaming: all four canonical exploit categories blocked
  • Token efficiency: efficient agent grade 0.603 vs token-stuffed agent grade 0.010, delta 0.593
  • Determinism: same seed produces identical scores across reruns

5. Multi-agent: supervisor and four specialist workers

Theme 1 of the hackathon was multi-agent interactions, so the environment has a real one β€” not a stub.

A Supervisor agent receives the cluster observation and can /delegate work to four specialist workers:

  • LogInspectorWorker β€” Flight Recorder + NCCL subsystem analysis
  • PatchAgentWorker β€” staged patching workflow (identify β†’ propose β†’ apply)
  • TopoAgentWorker β€” topology reorder and bandwidth checks
  • VersionCheckerWorker β€” NCCL version + LD_LIBRARY_PATH audit

Workers return structured outputs with confidence scores and uncertainty signals. The supervisor can also poll /consensus to get all four workers' votes on the next action, or attempt a /coalition that combines two workers' actions atomically (e.g. topology_version_fix requires both the topology agent and the version checker to agree).

The supervisor has a delegation budget of 10 per episode. Over-budget delegations cost coordination reward β€” so the agent has to learn when to escalate to a specialist vs handle something itself.

I evaluated this with two reference agents:

Agent Mean score Pass rate
Coordination oracle (delegates correctly, then remediates) 0.990 100%
Direct-action baseline (skips specialist evidence) 0.100 0%

That's a 9.9Γ— gap purely from coordination behavior. Evidence: results/multi_agent_eval.json.


6. Adaptive curriculum

For Theme 4 (self-improvement), the environment has an adaptive curriculum that escalates difficulty in response to agent performance:

Level Trigger Red herring % Telemetry mask % Refresh interval
1 (base) < 75% pass rate 30% 20% 3 steps
2 (hard) > 75% sustained 50% 35% 5 steps
3 (hardest) > 75% at level 2 70% 50% 7 steps

Escalation requires pass_rate > 0.75 AND trend > 0.01 over the last 20 episodes, with a 10-episode cooldown to prevent oscillation. Backoff triggers when pass_rate < 0.20. Agents also earn a meta-reward bonus for improving over their own rolling window β€” not just for solving β€” which stabilizes early-curriculum learning.

You can monitor it live: GET /curriculum returns the current state, GET /curriculum/events returns the escalation/backoff log.


7. Training: 800 trajectories, Qwen2.5-7B, TRL + PEFT on HF Jobs

For the actual policy I picked Qwen2.5-7B-Instruct in 4-bit (unsloth/Qwen2.5-7B-Instruct-bnb-4bit) because it's the best small-enough-to-train OSS model with strong instruction following and tool-call behavior.

Training stack:

  • Framework: Hugging Face TRL 0.18.2 + PEFT 0.18.0 + bitsandbytes 0.49.2
  • Method: SFT with LoRA rank=16, alpha=16, target modules q/k/v/o/gate/up/down_proj
  • Data: 800 multi-step SRE rollout examples generated by the environment itself β€” the env produces oracle-quality trajectories under seed control, then we use those as supervision
  • Hardware: NVIDIA A10G on Hugging Face Jobs
  • Logged training steps: 40 (per trainer.state.log_history, published at results/sft_warmup_metrics.json)

The training script is in training/grpo_train.py (the file name says GRPO because it also contains a GRPO continuation pipeline that loops back to the environment for online RL β€” but the submitted adapter is the SFT warmup, see Section 11 on honest limitations).

A single end-to-end command spins up the env, runs SFT, runs GRPO continuation, and runs constrained eval against raw / SFT / SFT+GRPO policies:

bash scripts/run_overnight.sh

I trained on Hugging Face Jobs because (1) it's reproducible by judges and (2) the entire job log is preserved at https://huggingface.co/jobs/v4xsh/69eda4d1d2c8bd8662bcf435.


8. The results

Loss curve

Training loss fell from 2.53 to ~0.10 over the 40 logged SFT steps in the published trainer.state.log_history, with the steepest descent in the first ~20 steps as the model learned the JSON action grammar. The plot below is generated directly from the artifact JSON (results/sft_warmup_metrics.json), not transcribed from a screenshot:

Final SFT loss curve

Note on training-evidence provenance. The published sre_agent_lora_oraclefix adapter and the 0.915 evaluation are from a 300-step SFT run on Hugging Face Jobs (terminal proof in docs/proof/final_trained_console.png, job log 69eda4d1d2c8bd8662bcf435). The per-step results/sft_warmup_metrics.json (40 logged entries / ~160 actual steps at logging_steps=4) is from a faithful local re-run with the same script, dataset, and hyperparameters, published so judges can inspect a real trainer.state.log_history directly. Both runs converge to the same loss regime (β‰ˆ0.08–0.10).

Before vs after, with constrained action scoring

The most important number in this submission isn't the trained score on its own β€” it's the comparison against the raw base model using the same constrained action-scoring approach and base model checkpoint. The final trained run adds the phase-aware workflow controller for hard and cascade tasks, matching the environment's own phase-gated task semantics:

Raw base model vs final trained policy score

  • Raw unsloth/Qwen2.5-7B-Instruct-bnb-4bit, no adapter: mean_score = 0.239, pass_rate = 0/9
  • Same base + SFT LoRA sre_agent_lora_oraclefix: mean_score = 0.915, pass_rate = 12/12

That's a 0.676 absolute improvement and a 0% β†’ 100% pass rate jump from the same base checkpoint after loading the SFT LoRA adapter and evaluating with the final phase-aware workflow constraints.

Per-task breakdown

Final phase-aware trained LoRA score by task

  • Easy (OOM rank ID): 0.99 across all 3 seeds
  • Medium (spine congestion recovery): 0.99 across all 3 seeds
  • Hard (staged desync patching): 0.85, 0.85, 0.99 across 3 seeds
  • Cascade (3-phase chain): 0.782, 0.782, 0.782 across 3 seeds (deterministic)

The raw eval JSON is committed at results/final_phaseaware_model_eval.json so judges can verify these are the actual numbers without rerunning anything.

Console proof

I'm including the raw HF Jobs / terminal screenshots, because rendered charts are easy to fake but timestamped console output isn't.

Before β€” raw model on HF Jobs, no adapter:

Raw base model HF Jobs evaluation console

[raw_untrained_constrained] task=easy seed=0 score=0.020 steps=8
[raw_untrained_constrained] task=easy seed=1 score=0.020 steps=8
[raw_untrained_constrained] task=easy seed=2 score=0.020 steps=8
[raw_untrained_constrained] task=medium seed=0 score=0.698 steps=8
[raw_untrained_constrained] task=medium seed=1 score=0.666 steps=8
[raw_untrained_constrained] task=medium seed=2 score=0.638 steps=8
[raw_untrained_constrained] task=hard seed=0 score=0.030 steps=8
[raw_untrained_constrained] task=hard seed=1 score=0.030 steps=8
[raw_untrained_constrained] task=hard seed=2 score=0.030 steps=8
{
  "mean_score": 0.23905299382262772,
  "pass_rate": 0.0,
  "n_episodes": 9
}

After β€” same base + SFT adapter:

Final trained policy terminal evaluation

[sft_oraclefix_phaseaware] task=easy seed=0 score=0.990 steps=8
... (12 rows, every one passing) ...
[sft_oraclefix_phaseaware] task=cascade seed=2 score=0.782 steps=8
{
  "mean_score": 0.9146806281246708,
  "pass_rate": 1.0,
  "n_episodes": 12
}

Same base checkpoint and constrained action-scoring setup; the final run loads the SFT LoRA adapter and uses the phase-aware controller needed for the long-horizon hard/cascade workflows.


9. The evaluation methodology (and why I'm being explicit about it)

Honesty matters here, because it's easy to get a high number with the wrong evaluator.

The phase-aware constrained evaluator (scripts/evaluate_model.py) does this for each step:

  1. Build a candidate set of valid next-step JSON actions for the current task phase
  2. Score each candidate's log-probability under the model
  3. Execute the highest-likelihood candidate
  4. Update an internal PhaseController so the candidate set advances to the next phase when the grader confirms phase completion

This is the same workflow constraint the environment enforces internally β€” patches without investigation are blocked, cascade phases must complete in order. The evaluator just makes those constraints explicit to the model at generation time.

This is not free-form generation. I report it that way explicitly in the README's Known Limitations and again here. The trained model also generates valid JSON freely β€” that's why it scores high under candidate ranking β€” but if a judge wants free-form generation numbers the cleanest path is to remove the PhaseController and rerun (see scripts/evaluate_model.py:144 for the controller definition). I chose constrained eval because the workflow constraints are intrinsic to the task semantics; loosening them changes what we're testing.


10. Reproduction (one Colab cell)

Everything in this post is reproducible. The Colab notebook:

https://colab.research.google.com/drive/1twXiRHoAxchy9UUgn7S15a2ac3GrtAlW?usp=sharing

It runs three cells:

  1. Pull the public LoRA adapter v4xsh/nervousystem-sre-agent-lora
  2. Hit the public Space at https://v4xsh-nervousystem-env.hf.space
  3. Run scripts/evaluate_model.py with constrained scoring

Expected output: mean_score β‰ˆ 0.915, pass_rate = 1.0 on easy,medium,hard,cascade Γ— 3 seeds.

The same repository also includes a full retraining command (scripts/run_overnight.sh) for an A10G-class GPU.


11. Honest limitations

The submitted adapter is an SFT warmup policy, not a fully optimized online RL policy. The GRPO continuation loop in training/grpo_train.py is wired to environment reward and works end-to-end; I just don't have a published GRPO run with a saved reward curve. The shipped 0.915 number is from supervised LoRA over environment-generated SRE rollouts, evaluated under the phase-aware constrained scorer.

The simulator is deterministic under seed for reproducibility, and it models production-inspired failure signatures β€” PyTorch Flight Recorder-style NCCL traces, real NCCL log formatting, real XID error codes β€” rather than connecting to a real GPU cluster. That's the right tradeoff for a hackathon environment, but anyone who wants to take this further should swap the simulator for a containerized cluster fault-injection harness (k8s node-shaper, libfabric mux, etc.).

The constrained evaluator is what gives the trained policy its 0.915 number. Free-form generation numbers would be lower β€” I expect somewhere around 0.6 β€” and reporting both side-by-side would be the next thing to add.


12. What I learned

The most surprising thing about this build wasn't the training. It was how much of the work was getting the reward structure to be honest. Every time I added a reward signal, I'd build an exploit agent that game-played it, watch the exploit agent score 0.7+, and have to redesign. The token-efficiency multiplier was the third iteration. The phase precondition gating was the fourth. The cumulative-token integration into final score (not just MER) was the fifth.

By the time the seven exploit agents all collapsed to ≀0.35, I trusted the reward enough to train against. That trust step matters more than any specific number β€” without it, every "improvement" is just learned exploitation.

The second surprise: partial observability is what makes the environment teach something. My first version had clean telemetry. The model trained perfectly on it and produced an agent that confidently misidentified ranks the moment I added 30% red herrings. Adding noise didn't break the agent β€” it forced it to learn to investigate before acting, which is the actual SRE skill.


13. Why this matters beyond the hackathon

Frontier AI systems run on infrastructure that's already too complex for a single human to debug in real time. Today, when a 4096-GPU job stalls at 3 a.m., a human SRE pages in, scrolls through Flight Recorder dumps, and makes a judgment call. There is no good reason that judgment call has to be human in 2026 onward β€” but there's no public training environment for the agent that should be making it.

NervousSystem-Env is one such environment. It's not the right one for every failure mode, and it's not the only one that should exist. But it wraps production-inspired PyTorch/NCCL failure semantics, multi-agent delegation, and reward-grade exploit resistance into something an LLM can be trained on with a public training script.

We are not training an agent to play a toy game. We are training an agent to keep AI systems alive when the infrastructure underneath them starts failing.


Links


Built for the OpenEnv Hackathon (India 2026). Origin: Hornet, my local voice assistant. If your AI also runs on a single GPU and you've spent more time debugging the hardware than the model, this environment is for you.