πŸ›οΈ Sovereign-GRPO-V1 β€” Multi-Objective Self-Healing GRPO

Training Recipe

3 reward signals with decoupled weighting + autonomous plateau/collapse recovery:

Weight Reward Correct Hallucination Refusal
0.40 R₁ Correctness +1.5 -1.0 None
0.25 Rβ‚‚ Logic Density 0β†’1.0 -0.3 -0.5
0.30 R₃ Refusal Calibration +1.0 -2.5 0.0
0.05 Rβ‚„ Length Penalty 0.0 0.0 0.0

Combined gradient: Correct = +1.07, Hallucination = -1.23, Refusal = -0.25

Self-Healing Callback

  • Plateau (reward Ξ” < 0.01 for 15 steps): Ξ² Γ— 1.5, LR Γ— 0.5
  • Collapse (reward < -1.5 for 8 steps): checkpoint + stop
  • Entropy death (< 0.5): warning logged

Launch

pip install trl transformers torch datasets accelerate trackio
MODEL_ID=Qwen/Qwen2.5-3B-Instruct accelerate launch sovereign_grpo_train.py
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for moro72842/Sovereign-GRPO-V1

Base model

Qwen/Qwen2.5-3B
Finetuned
(1268)
this model

Dataset used to train moro72842/Sovereign-GRPO-V1