if_step200

Qwen3-8B-Base fine-tuned with GRPO (via slime) on multi-constraint instruction-following data. Training reward = IFEval-G rule checks + a Skywork-Reward-V2-Llama-3.1-8B judge bonus.

This is the iteration 200 checkpoint — the last clean checkpoint while the reward model was healthy (the RM judge went offline around step ~287, after which later checkpoints show reward-hacking on the rule checkers). Converted from Megatron torch_dist to HF format.

Downloads last month
1
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for arlarena/if_step200

Finetuned
(451)
this model