ποΈ Sovereign-GRPO-V1 β Multi-Objective Self-Healing GRPO
Training Recipe
3 reward signals with decoupled weighting + autonomous plateau/collapse recovery:
| Weight | Reward | Correct | Hallucination | Refusal |
|---|---|---|---|---|
| 0.40 | Rβ Correctness | +1.5 | -1.0 | None |
| 0.25 | Rβ Logic Density | 0β1.0 | -0.3 | -0.5 |
| 0.30 | Rβ Refusal Calibration | +1.0 | -2.5 | 0.0 |
| 0.05 | Rβ Length Penalty | 0.0 | 0.0 | 0.0 |
Combined gradient: Correct = +1.07, Hallucination = -1.23, Refusal = -0.25
Self-Healing Callback
- Plateau (reward Ξ < 0.01 for 15 steps): Ξ² Γ 1.5, LR Γ 0.5
- Collapse (reward < -1.5 for 8 steps): checkpoint + stop
- Entropy death (< 0.5): warning logged
Launch
pip install trl transformers torch datasets accelerate trackio
MODEL_ID=Qwen/Qwen2.5-3B-Instruct accelerate launch sovereign_grpo_train.py