sergiopaniego HF Staff commited on
Commit
ffb28aa
·
verified ·
1 Parent(s): d56e170

Update app/src/content/article.mdx

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +1 -1
app/src/content/article.mdx CHANGED
@@ -288,7 +288,7 @@ Having shown that GOLD handles tokenizer differences effectively, we now benchma
288
 
289
  ## On-policy distillation outperforms GRPO
290
 
291
- On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].
292
 
293
  We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components:
294
 
 
288
 
289
  ## On-policy distillation outperforms GRPO
290
 
291
+ On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [@shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].
292
 
293
  We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components:
294