kashif HF Staff commited on
Commit
1473b99
·
1 Parent(s): 12fed3e

remove unrendered link

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +2 -2
app/src/content/article.mdx CHANGED
@@ -130,7 +130,7 @@ The main limitation with all on-policy distillation methods is that they assume
130
  downloadable
131
  loading="lazy"
132
  alt="ULD Diagram"
133
- caption={'Figure 1: Previous work, ULD by Boizard et al. [@boizard2025crosstokenizerdistillationuniversallogit]. Demonstrate offline distillation on student and teacher models with unmatched tokenizers. GOLD extends their method to the on-policy setting and addresses two weaknesses: token alignment in step 3 and logit alignment in step 4.'}
134
  />
135
 
136
  ULD showed that using distillation between models with different tokenizers introduces two key challenges:
@@ -273,7 +273,7 @@ The tables also show that the tokenizer between Qwen2.5 and Qwen3 versions diffe
273
 
274
  ### GKD with the Same Tokenizer
275
 
276
- Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
277
 
278
  The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.
279
 
 
130
  downloadable
131
  loading="lazy"
132
  alt="ULD Diagram"
133
+ caption={'Figure 1: Previous work, ULD by Boizard et al. demonstrates offline distillation on student and teacher models with unmatched tokenizers. GOLD extends their method to the on-policy setting and addresses two weaknesses: token alignment in step 3 and logit alignment in step 4.'}
134
  />
135
 
136
  ULD showed that using distillation between models with different tokenizers introduces two key challenges:
 
273
 
274
  ### GKD with the Same Tokenizer
275
 
276
+ Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al. [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
277
 
278
  The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.
279