fix(train): fetch state via env.state(); fix(verifier): under-investigation penalty -1.0 -> -3.0 (unblocks GRPO advantage) d6b190b verified siddham0909 commited on Apr 26
initial: slim Dockerfile + Gradio UI + env + train pipeline cbab001 verified siddham0909 commited on Apr 25