if_step200

Qwen3-8B-Base fine-tuned with GRPO (via slime) on multi-constraint instruction-following data. Training reward = IFEval-G rule checks + a Skywork-Reward-V2-Llama-3.1-8B judge bonus.

This is the iteration 200 checkpoint — the last clean checkpoint while the reward model was healthy (the RM judge went offline around step ~287, after which later checkpoints show reward-hacking on the rule checkers). Converted from Megatron torch_dist to HF format.