🏛️ Sovereign-GRPO-V1 — Multi-Objective Self-Healing GRPO

Training Recipe

3 reward signals with decoupled weighting + autonomous plateau/collapse recovery:

Weight	Reward	Correct	Hallucination	Refusal
0.40	R₁ Correctness	+1.5	-1.0	None
0.25	R₂ Logic Density	0→1.0	-0.3	-0.5
0.30	R₃ Refusal Calibration	+1.0	-2.5	0.0
0.05	R₄ Length Penalty	0.0	0.0	0.0

Combined gradient: Correct = +1.07, Hallucination = -1.23, Refusal = -0.25

pip install trl transformers torch datasets accelerate trackio
MODEL_ID=Qwen/Qwen2.5-3B-Instruct accelerate launch sovereign_grpo_train.py

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Base model

Finetuned

Finetuned

this model