aws-rl-grpo-qwen25coder3b-adapter
GRPO (Group Relative Policy Optimization) LoRA adapter continuing from
Sizzing/aws-rl-sft-qwen25coder3b-adapter.
Trained single-step with live AWS RL env rewards; evaluated multi-step.
How to load
from unsloth import FastLanguageModel
from peft import PeftModel
base, tok = FastLanguageModel.from_pretrained(
"unsloth/Qwen2.5-Coder-3B-Instruct-bnb-4bit", max_seq_length=3072, load_in_4bit=True,
)
model = PeftModel.from_pretrained(base, "Sizzing/aws-rl-grpo-qwen25coder3b-adapter")
FastLanguageModel.for_inference(model)
Training recipe
| Knob | Value |
|---|---|
| learning_rate | 1.60e-05 |
| beta (KL coef) | 0.002 |
| num_generations (G) | 8 |
| temperature | 0.99 |
| max_completion_length | 768 |
| per-device batch | 2 x 8 accum |
Multi-step eval (reserve + drift)
| Metric | SFT baseline | GRPO | Delta |
|---|---|---|---|
| overall_success_rate | 0.868 | 0.862 | -0.006 |
| overall_reward_mean | 0.883 | 0.877 | -0.006 |
| hints_per_solved | 0.000 | 0.000 | +0.000 |
| recovery_rate | 0.333 | 0.000 | -0.333 |
| drift_repair_rate | 0.222 | 0.222 | +0.000 |
| steps_to_solve | 1.446 | 1.553 | +0.108 |
Trained 2026-04-25T19:02+00:00 on Tesla T4.
- Downloads last month
- 43
Model tree for Sizzing/aws-rl-grpo-qwen25coder3b-adapter
Base model
Qwen/Qwen2.5-3B Finetuned
Qwen/Qwen2.5-Coder-3B Finetuned
Qwen/Qwen2.5-Coder-3B-Instruct Finetuned
Sizzing/aws-rl-sft-qwen25coder3b-adapter