+ continuous batching makes GRPO and RLOO 1.25x faster at -16 GB + proper MoE post-training across GRPO/RLOO/AsyncGRPO + new GMPO trainer + AsyncGRPO weight sync + padding-free + more
GLM-5.2 is open and comes with competitive performance against opus 4.8
day-0 in transformers + vllm + sglang, mit license 🤗
on the post-training side: critic-based ppo for variable-length agentic rollouts (ppo is back!) + an online anti-reward-hacking module that feeds the agent dummy info when it tries to cheat