nervousystem-sre-agent-lora

LoRA adapter for unsloth/Qwen2.5-7B-Instruct-bnb-4bit, fine-tuned to act as a Site Reliability Engineer agent for distributed GPU training fleets inside the NervousSystem-Env OpenEnv environment.

TL;DR

Base model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
Adapter: LoRA via PEFT (r=16, alpha=16, target q/k/v/o/gate/up/down_proj)
Training method: Supervised fine-tuning (SFT) using Hugging Face TRL
Training data: 800 multi-step SRE rollout trajectories generated by the NervousSystem-Env OpenEnv environment under deterministic seeds
Logged training steps: 40 (per trainer.state.log_history)
Hardware: NVIDIA A10G on Hugging Face Jobs
Loss: 2.53 → ~0.10 (real per-step values published at results/sft_warmup_metrics.json in the env repo)
Final evaluation: 0.915 mean score, 100% pass rate over 12 phase-aware constrained episodes (easy, medium, hard, cascade × 3 seeds each)

This adapter is the SFT warmup policy described in the NervousSystem-Env submission. The same training repository also includes an optional GRPO continuation pipeline that loops back to environment rewards, but the published adapter weights here are the SFT result.

Intended Use

This adapter is built to take partial cluster telemetry from the NervousSystem-Env OpenEnv environment and emit a single valid JSON SRE remediation action per step, for example:

{"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 3}}
{"action_type": "topo_reorder", "parameters": {"affinity": "rack"}}
{"action_type": "patch_divergent_code", "parameters": {"file": "model/transformer.py", "fix_type": "synchronize_conditional"}}

It is intended for research/educational use inside the linked environment. It is not a general-purpose chat assistant and should not be used as one.

Out-of-Scope Use

Production cluster operations or any safety-critical environment.
General conversation, code generation, or unrelated tool use.
Any setting where wrong actions could destabilize real hardware. The training distribution is a simulator.

Training Data

The training data consists of 800 oracle-style multi-step trajectories generated by NervousSystem-Env across the easy, medium, hard, and cascade tasks under deterministic seeds. Trajectories include the partial cluster observation, the chosen JSON action, and the resulting environment transitions. No external/private data is used.

Training Procedure

Framework: Hugging Face TRL (0.18.2) + PEFT (0.18.0) + bitsandbytes
Optimizer: AdamW
Precision: 4-bit base + LoRA adapter
Sequence length: 2048
Logged steps: 40 (from trainer.state.log_history, published as results/sft_warmup_metrics.json)
Hardware: NVIDIA A10G (Hugging Face Jobs)

The full training script is published in the environment repository under training/grpo_train.py. The exact Hugging Face Jobs invocation used to produce this adapter is documented in the NervousSystem-Env README.

Evaluation

Evaluation uses the phase-aware constrained action scoring evaluator in scripts/evaluate_model.py from the environment repo. For each step, the model ranks valid next-step JSON actions for the current task phase, and the environment executes the highest-likelihood action.

Metric	Value
Mean score	0.9146806281246708
Pass rate	100%
Episodes	12 (`easy`, `medium`, `hard`, `cascade` × 3 seeds)
Raw base model under same constrained scoring	0.239 / 0% pass

Per-task scores:

Task	Scores by seed
easy	0.99 / 0.99 / 0.99
medium	0.99 / 0.99 / 0.99
hard	0.85 / 0.85 / 0.99
cascade	0.782 / 0.782 / 0.782

Full eval JSON: results/final_phaseaware_model_eval.json in the environment repo.

How to Use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
adapter_id = "v4xsh/nervousystem-sre-agent-lora"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

Then send NervousSystem-Env observations as a chat-formatted prompt and parse the JSON action from the model's reply. The full eval script is scripts/evaluate_model.py in the environment repo.

Limitations and Honest Disclosure

This adapter is an SFT warmup policy, not a fully optimized online RL policy. The GRPO continuation loop in training/grpo_train.py works end-to-end on environment reward but is not the source of the published weights here.
The reported 0.915 score uses phase-aware constrained action scoring, which restricts candidate actions to the current task phase. Free-form generation numbers would be lower; this is documented in the environment repo's README and Blog.md.
The simulator is deterministic under seed and models production-inspired failure signatures, not a real GPU cluster.

License

Apache 2.0 for the adapter weights. The base model unsloth/Qwen2.5-7B-Instruct-bnb-4bit is governed by its own license; please consult the upstream model card before redistribution.