aws-rl-grpo-qwen25coder3b-adapter

GRPO (Group Relative Policy Optimization) LoRA adapter continuing from Sizzing/aws-rl-sft-qwen25coder3b-adapter. Trained single-step with live AWS RL env rewards; evaluated multi-step.

How to load

from unsloth import FastLanguageModel
from peft import PeftModel

base, tok = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit", max_seq_length=3072, load_in_4bit=True,
)
model = PeftModel.from_pretrained(base, "Sizzing/aws-rl-grpo-qwen25coder3b-adapter")
FastLanguageModel.for_inference(model)

Training recipe

Knob	Value
learning_rate	1.60e-05
beta (KL coef)	0.002
num_generations (G)	8
temperature	0.99
max_completion_length	768
per-device batch	2 x 8 accum

Multi-step eval (reserve + drift)

Metric	SFT baseline	GRPO	Delta
overall_success_rate	0.868	0.862	-0.006
overall_reward_mean	0.883	0.877	-0.006
hints_per_solved	0.000	0.000	+0.000
recovery_rate	0.333	0.000	-0.333
drift_repair_rate	0.222	0.222	+0.000
steps_to_solve	1.446	1.553	+0.108

Trained 2026-04-25T19:02+00:00 on Tesla T4.

Downloads last month: 43

Model tree for Sizzing/aws-rl-grpo-qwen25coder3b-adapter

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-Coder-3B

Finetuned

Qwen/Qwen2.5-Coder-3B-Instruct

Quantized

unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit

Finetuned

Sizzing/aws-rl-sft-qwen25coder3b-adapter

Adapter

(1)

this model