Update app/src/content/article.mdx
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -288,7 +288,7 @@ Having shown that GOLD handles tokenizer differences effectively, we now benchma
|
|
| 288 |
|
| 289 |
## On-policy distillation outperforms GRPO
|
| 290 |
|
| 291 |
-
On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].
|
| 292 |
|
| 293 |
We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components:
|
| 294 |
|
|
|
|
| 288 |
|
| 289 |
## On-policy distillation outperforms GRPO
|
| 290 |
|
| 291 |
+
On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [@shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].
|
| 292 |
|
| 293 |
We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components:
|
| 294 |
|