DevOps Pipeline Gym, SFT Adapter
A 1.7B model that learned to investigate before acting on production incidents.
What this is
A 1.7B on-call agent that scores -0.044 on judgment_call (seed 5003), beating every untrained baseline we tested from 7B Qwen2.5 up to 671B DeepSeek-V3.1. Under the hood it is a QLoRA adapter on top of unsloth/Qwen3-1.7B-bnb-4bit, running inside the DevOps Pipeline Gym OpenEnv environment.
The environment simulates production incident response across 5 microservices in a dependency graph. The agent rotates between three roles (DEV, SRE, OPS). It has to investigate before acting, find the root cause through cascading symptoms, and pick from several valid recovery paths. Every reward component is deterministic Python. There is no LLM judge in the loop and no API calls going anywhere.
What it learned
Trained on 80 expert trajectories for 2 epochs on a free Kaggle T4 (~30 minutes wall clock). 17.4M trainable parameters (1.69% of base). QLoRA configuration: r=16, alpha=32, dropout=0.05, applied to all attention + MLP modules.
The hero result: a 1.7B model trained on 80 trajectories outperforms 7B-671B frontier models on the same judgment_call task. Same task, same seed family, same prompt format, same scoring rubric. Frontier baselines hit through HF Inference Router (n=3 seeds averaged for frontier, single-seed for our trained model and the 7B notebook baseline):
| Model | Size | Reward on judgment_call |
Δ ours beats |
|---|---|---|---|
| Llama-3.3-70B-Instruct (untrained) | 70B | -1.815 | +1.771 |
| DeepSeek-V3.1 (untrained) | 671B MoE | -1.580 | +1.536 |
| Mistral-Large-Instruct-2411 (untrained) | 123B | -1.580 | +1.536 |
| Qwen2.5-72B-Instruct (untrained) | 72B | -1.232 | +1.188 |
| GPT-OSS-120B (untrained) | 120B MoE | -1.201 | +1.157 |
| Qwen2.5-7B-Instruct (untrained, baseline in notebook) | 7B | -1.200 | +1.156 |
| Qwen3-1.7B + this SFT adapter (TRAINED) | 1.7B | -0.044 | (baseline) |
A 1.7B model trained on 80 expert trajectories beats every untrained model we tested. That spans a 7B same-family Qwen baseline up to the 671B DeepSeek-V3.1, by +1.16 to +1.77 reward on this task. We did not run untrained Qwen3-1.7B as a same-family baseline within budget. The 7B Qwen2.5 row is the closest-size untrained model the demo notebook actually invokes via HF Router.
Frontier models default to either immediate abort (DeepSeek and Mistral both return -1.580 across all tasks) or attempted-but-failed action sequences. None of them succeed at the task without env-specific training. The trained 1.7B knows to investigate first, find the root cause, deploy carefully, and approve only when the system is healthy.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = "unsloth/Qwen3-1.7B-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "yashash045/devops-pipeline-gym-sft-adapter", subfolder="final")
# Then drive it through the env at https://yashash045-devops-pipeline-gym.hf.space
# See devops_pipeline_gym_colab.ipynb for an end-to-end runnable example.
Reproduce on free Kaggle T4 (~15 min)
The full eval pipeline runs on a free Kaggle T4. See scripts/kaggle_cell_acc1.py in the code repo. It's a single-cell paste that boots the env, downloads this adapter, runs multi-seed eval, and uploads results back to this repo.
Training stack
- Framework: TRL (
SFTTrainer) + PEFT (QLoRA) + bitsandbytes (4-bit NF4) - Trajectories: 80 expert demonstrations (chat-template format)
- Hardware: Kaggle T4 16GB, free tier
- Cost: $0
- Wall time: ~30 minutes
What's the larger work
This adapter is the trained policy for the DevOps Pipeline Gym, an OpenEnv environment built for the Meta PyTorch OpenEnv Hackathon (India 2026). The full submission includes:
- The env (deterministic, no-LLM-judge, role-rotated single policy)
- This SFT adapter (the trained policy)
- A GRPO refinement adapter (RL pipeline proof)
- Frontier baseline comparisons (5 models tested)
- Interactive Colab demo + Gradio "play as the on-call engineer" interface
- A narrative writeup (BLOG.md)
Citations
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
(The TRL citation above is for the open-source library we used to train this adapter. That's Hugging Face's trl package by von Werra et al. Standard academic credit for the training framework.)
- Downloads last month
- -