on-policy-distillation

Running

sergiopaniego HF Staff commited on Oct 29, 2025

Commit

ffb28aa

verified ·

1 Parent(s): d56e170

Update app/src/content/article.mdx

Files changed (1) hide show

app/src/content/article.mdx CHANGED Viewed

@@ -288,7 +288,7 @@ Having shown that GOLD handles tokenizer differences effectively, we now benchma
 ## On-policy distillation outperforms GRPO
-On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].
 We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components:

 ## On-policy distillation outperforms GRPO
+On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [@shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].
 We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components: