Paused RL SurviveCity v2 β GRPO training π§ 5-agent zombie-survival GRPO training (Qwen 2.5-3B + LoRA)