Run 6: forced-outcome variants + unlikeliness shaping + curriculum — break R2-only degenerate policy 68b2be2 verified chane335 commited on Apr 25