sergiopaniego HF Staff commited on
Commit
d56e170
·
verified ·
1 Parent(s): 899c610

Update app/src/content/article.mdx

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +1 -1
app/src/content/article.mdx CHANGED
@@ -233,7 +233,7 @@ The tables also show that the tokenizer between Qwen2.5 and Qwen3 versions diffe
233
 
234
  ### GKD with the Same Tokenizer
235
 
236
- Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5 We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
237
 
238
  The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.
239
 
 
233
 
234
  ### GKD with the Same Tokenizer
235
 
236
+ Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
237
 
238
  The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.
239