nervousystem-sre-agent-lora

LoRA adapter for unsloth/Qwen2.5-7B-Instruct-bnb-4bit, fine-tuned to act as a Site Reliability Engineer agent for distributed GPU training fleets inside the NervousSystem-Env OpenEnv environment.

TL;DR

  • Base model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
  • Adapter: LoRA via PEFT (r=16, alpha=16, target q/k/v/o/gate/up/down_proj)
  • Training method: Supervised fine-tuning (SFT) using Hugging Face TRL
  • Training data: 800 multi-step SRE rollout trajectories generated by the NervousSystem-Env OpenEnv environment under deterministic seeds
  • Logged training steps: 40 (per trainer.state.log_history)
  • Hardware: NVIDIA A10G on Hugging Face Jobs
  • Loss: 2.53 → ~0.10 (real per-step values published at results/sft_warmup_metrics.json in the env repo)
  • Final evaluation: 0.915 mean score, 100% pass rate over 12 phase-aware constrained episodes (easy, medium, hard, cascade × 3 seeds each)

This adapter is the SFT warmup policy described in the NervousSystem-Env submission. The same training repository also includes an optional GRPO continuation pipeline that loops back to environment rewards, but the published adapter weights here are the SFT result.

Intended Use

This adapter is built to take partial cluster telemetry from the NervousSystem-Env OpenEnv environment and emit a single valid JSON SRE remediation action per step, for example:

{"action_type": "inspect_flight_recorder", "parameters": {"rank_id": 3}}
{"action_type": "topo_reorder", "parameters": {"affinity": "rack"}}
{"action_type": "patch_divergent_code", "parameters": {"file": "model/transformer.py", "fix_type": "synchronize_conditional"}}

It is intended for research/educational use inside the linked environment. It is not a general-purpose chat assistant and should not be used as one.

Out-of-Scope Use

  • Production cluster operations or any safety-critical environment.
  • General conversation, code generation, or unrelated tool use.
  • Any setting where wrong actions could destabilize real hardware. The training distribution is a simulator.

Training Data

The training data consists of 800 oracle-style multi-step trajectories generated by NervousSystem-Env across the easy, medium, hard, and cascade tasks under deterministic seeds. Trajectories include the partial cluster observation, the chosen JSON action, and the resulting environment transitions. No external/private data is used.

Training Procedure

  • Framework: Hugging Face TRL (0.18.2) + PEFT (0.18.0) + bitsandbytes
  • Optimizer: AdamW
  • Precision: 4-bit base + LoRA adapter
  • Sequence length: 2048
  • Logged steps: 40 (from trainer.state.log_history, published as results/sft_warmup_metrics.json)
  • Hardware: NVIDIA A10G (Hugging Face Jobs)

The full training script is published in the environment repository under training/grpo_train.py. The exact Hugging Face Jobs invocation used to produce this adapter is documented in the NervousSystem-Env README.

Evaluation

Evaluation uses the phase-aware constrained action scoring evaluator in scripts/evaluate_model.py from the environment repo. For each step, the model ranks valid next-step JSON actions for the current task phase, and the environment executes the highest-likelihood action.

Metric Value
Mean score 0.9146806281246708
Pass rate 100%
Episodes 12 (easy, medium, hard, cascade × 3 seeds)
Raw base model under same constrained scoring 0.239 / 0% pass

Per-task scores:

Task Scores by seed
easy 0.99 / 0.99 / 0.99
medium 0.99 / 0.99 / 0.99
hard 0.85 / 0.85 / 0.99
cascade 0.782 / 0.782 / 0.782

Full eval JSON: results/final_phaseaware_model_eval.json in the environment repo.

How to Use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
adapter_id = "v4xsh/nervousystem-sre-agent-lora"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

Then send NervousSystem-Env observations as a chat-formatted prompt and parse the JSON action from the model's reply. The full eval script is scripts/evaluate_model.py in the environment repo.

Limitations and Honest Disclosure

  • This adapter is an SFT warmup policy, not a fully optimized online RL policy. The GRPO continuation loop in training/grpo_train.py works end-to-end on environment reward but is not the source of the published weights here.
  • The reported 0.915 score uses phase-aware constrained action scoring, which restricts candidate actions to the current task phase. Free-form generation numbers would be lower; this is documented in the environment repo's README and Blog.md.
  • The simulator is deterministic under seed and models production-inspired failure signatures, not a real GPU cluster.

Links

License

Apache 2.0 for the adapter weights. The base model unsloth/Qwen2.5-7B-Instruct-bnb-4bit is governed by its own license; please consult the upstream model card before redistribution.

Downloads last month
51
Video Preview
loading

Model tree for v4xsh/nervousystem-sre-agent-lora

Base model

Qwen/Qwen2.5-7B
Adapter
(47)
this model