aws-rl-grpo-qwen25coder3b-adapter

GRPO (Group Relative Policy Optimization) LoRA adapter continuing from Sizzing/aws-rl-sft-qwen25coder3b-adapter. Trained single-step with live AWS RL env rewards; evaluated multi-step.

How to load

from unsloth import FastLanguageModel
from peft import PeftModel

base, tok = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit", max_seq_length=3072, load_in_4bit=True,
)
model = PeftModel.from_pretrained(base, "Sizzing/aws-rl-grpo-qwen25coder3b-adapter")
FastLanguageModel.for_inference(model)

Training recipe

Knob Value
learning_rate 1.60e-05
beta (KL coef) 0.002
num_generations (G) 8
temperature 0.99
max_completion_length 768
per-device batch 2 x 8 accum

Multi-step eval (reserve + drift)

Metric SFT baseline GRPO Delta
overall_success_rate 0.868 0.862 -0.006
overall_reward_mean 0.883 0.877 -0.006
hints_per_solved 0.000 0.000 +0.000
recovery_rate 0.333 0.000 -0.333
drift_repair_rate 0.222 0.222 +0.000
steps_to_solve 1.446 1.553 +0.108

Trained 2026-04-25T19:02+00:00 on Tesla T4.

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sizzing/aws-rl-grpo-qwen25coder3b-adapter