omar81939/rl4rlm-sft
Text Generation • Updated
LoRA adapters (Qwen3-1.7B) for training RLMs via RL. SFT, STaR, DPO, GRPO-v4. Code: github.com/pythonomar22/rl4rlm
Note SFT on 87 self-bootstrap trajectories (76.8% avg)
Note STaR: iterative SFT, 132 trajectories (76.3% avg)
Note Best model (84.5% avg). +29.5pp multi-needle over STaR.
Note GRPO-v4: fixed log-probs + token-level KL (83.4% avg)