Instructions to use benchflow/benchflow-qwen35-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use benchflow/benchflow-qwen35-9b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "benchflow/benchflow-qwen35-9b") - Notebooks
- Google Colab
- Kaggle
Qwen3.5-9B General-Agent SFT LoRA Adapter
v0.0.1 is the completed LoRA SFT release for the Prime general-agent reproduction using the full, non-prequantized Qwen/Qwen3.5-9B base checkpoint. It does not include the base weights; load this adapter on top of Qwen/Qwen3.5-9B.
This release intentionally excludes the next QLoRA run that is currently in progress. That run will be documented and tagged separately after its training and eval finish.
Release Summary
| Field | Value |
|---|---|
| Release tag | v0.0.1 |
| Adapter repo | benchflow/benchflow-qwen35-9b |
| Base checkpoint | Qwen/Qwen3.5-9B |
| Base checkpoint form | Full, non-quantized source checkpoint; frozen during LoRA SFT |
| Adapter type | LoRA / PEFT |
| Source completed run | general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z |
| W&B project | general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z |
| HF training artifacts | benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z |
| Published at | 2026-06-24 22:27:07 UTC |
Research Reproduction Scope
The goal of this adapter is to reproduce the SFT-stage lift from Prime Intellect's general-agent work as closely as possible while using a smaller student model that can train on one H100. The stack keeps the Prime-style task and verifier path:
- Source tasks: open-source
PrimeIntellect-ai/research-environments/environments/general_agenttask corpus. - Teacher trace generation:
general-agent-solver-rlm+ Azure GPT-5.4-mini through native Verifiers /vf-eval --save-resultsartifacts. - SFT trainer: Prime-RL SFT.
- Student: full, non-quantized
Qwen/Qwen3.5-9Bloaded in BF16 with LoRA adapters. - Eval:
general-agent-solver-localthrough nativevf-eval --save-resultson the same held-in task sets before and after SFT.
Data Recipe
| Field | Value |
|---|---|
| Dataset | benchflow/general-agent-qwen35-9b-azure-gpt54mini-sft |
| Dataset rows | 4414 |
| Original source task count | 4417 |
| Teacher model | Azure GPT-5.4-mini |
| Teacher harness | Prime/Verifiers general-agent-solver-rlm |
| Artifact format | Native vf-eval --save-results trajectories converted to Prime-RL messages + tool_defs SFT rows |
| Excluded source tasks | dog_breeding_t1, skydiving_center_t1, skydiving_center_t2 |
| Exclusion reason | Stable Azure content-filter blocks during teacher trace generation |
| Full teacher sweep artifact | benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-daytona-teacher-full4417-tunnel8-20260624T015706Z |
| Data validation | Prime SFT JSONL validator rejected non-leading system messages and leakage fields before training |
Training Parameters
| Field | Value |
|---|---|
| Trainer | Prime-RL SFT |
| Model loaded for SFT | Qwen/Qwen3.5-9B full BF16 base weights |
| Quantization | None for the completed v0.0.1 LoRA run |
| Adapter | LoRA |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | about 29.1M |
| Adapted base params | about 5.30B |
| Total base params loaded | about 9.44B |
| Sequence length | 2048 |
| Global batch size | 8 |
| Micro batch size | 1 |
| Pack function | cat |
| Shuffle | true |
| Seed | 0 |
| Optimizer | AdamW |
| Learning rate | 5e-5 |
| Weight decay | 0.01 |
| Betas | 0.9, 0.999 |
| Grad norm clip | 1.0 |
| Scheduler | Linear |
| Warmup steps | 20 |
| Decay steps | 180 |
| Minimum LR | 0.0 |
| Max steps | 200 |
| Checkpoint interval | 20 |
| Keep last | 3 |
| Keep interval | 100 |
| Save format | safetensors |
| Loss mask | Assistant messages only; system, user, and tool messages are context-only |
Training Result
| Metric | Value |
|---|---|
| Completed step | 200 |
| Final loss | 0.11897 |
loss/nan_count |
0 |
| Peak GPU memory | about 40.8 GiB |
| Final adapter | adapter_model.safetensors in this repo |
The initial data.seq_len=8192 Prime-RL BF16 LoRA attempt OOMed on one H100. The completed v0.0.1 run used data.seq_len=2048, system CUDA 12.8 nvcc/ptxas, and g++-12 for the required FLA/TileLang kernels.
Evaluation Results
All evaluations below use native Verifiers vf-eval --save-results, general-agent-solver-local, serving context length 4096, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml. Dynamic vLLM LoRA loading was not reliable for this stack, so eval served a merged local checkpoint built from this adapter plus Qwen/Qwen3.5-9B.
| Task set | Base pass rate | LoRA SFT pass rate | Delta | Notes |
|---|---|---|---|---|
| Held-in 5 smoke | 1/5 = 20.00% |
2/5 = 40.00% |
+20.00 pp |
First serving/eval smoke |
| Held-in 20 | 11/20 = 55.00% |
13/20 = 65.00% |
+10.00 pp |
Recovered 3d_print_shop_t1, accounting_firm_t1 |
| Held-in 36 | 20/36 = 55.56% |
23/36 = 63.89% |
+8.33% |
No regressions; recovered 3d_print_shop_t1, accounting_firm_t1, allergy_clinic_t0 |
| Held-in 50 assembled | 27/50 = 54.00% |
30/50 = 60.00% |
+6.00% |
Latest wider held-in result; final 14-task slice had no net delta |
| Held-in 50 final 14-task slice | 7/14 = 50.00% |
7/14 = 50.00% |
+0.00% |
Recovered animation_studio_t0; regressed antiquarian_bookshop_t0 |
Evaluation artifact prefixes:
- Held-in 5 smoke:
benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-smoke4096-20260624T152150Z - Held-in 20 comparison:
benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin20-compare-20260624 - Held-in 36 comparison:
benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin36-compare-20260624 - Held-in 50 final 14-task run:
benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin50-gap-20260624T190517Z
Loading
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)
Caveats
- This is an SFT-stage reproduction artifact, not the full Prime paper recipe with the original teacher and student model stack.
- The trainable dataset has
4414rows rather than4417because three Azure teacher prompts were blocked by content filtering. - The latest held-in50 assembled lift is positive but modest at
+6.00 pp; gains are concentrated in a small number of tasks rather than broad across-the-board recovery. - The next QLoRA seq8192 experiment is excluded from
v0.0.1and should receive its own update/tag only after it completes.
- Downloads last month
- 13