The model after Instruction Tuning, without GRPO To update