siddham0909's picture
fix(train): fetch state via env.state(); fix(verifier): under-investigation penalty -1.0 -> -3.0 (unblocks GRPO advantage)
d6b190b verified