ICYMI, you can fine-tune open LLMs using Claude Code
just tell it: βFine-tune Qwen3-0.6B on open-r1/codeforces-cotsβ
and Claude submits a real training job on HF GPUs using TRL.
it handles everything: > dataset validation > GPU selection > training + Trackio monitoring > job submission + cost estimation when itβs done, your model is on the Hub, ready to use
It includes GDPO, the latest variant of GRPO for multi-reward RL β¨ GDPO decouples reward normalization to avoid reward collapse and improve per-reward convergence β developed by @sliuau@SimonX et al.