on-policy-distillation

Running

App Files Files Community

kashif HF Staff commited on Oct 30, 2025

Commit

1473b99

1 Parent(s): 12fed3e

remove unrendered link

Browse files

Files changed (1) hide show

app/src/content/article.mdx +2 -2

app/src/content/article.mdx CHANGED Viewed

@@ -130,7 +130,7 @@ The main limitation with all on-policy distillation methods is that they assume
   downloadable
   loading="lazy"
   alt="ULD Diagram"
-  caption={'Figure 1: Previous work, ULD by Boizard et al. [@boizard2025crosstokenizerdistillationuniversallogit]. Demonstrate offline distillation on student and teacher models with unmatched tokenizers. GOLD extends their method to the on-policy setting and addresses two weaknesses: token alignment in step 3 and logit alignment in step 4.'}
 />
 ULD showed that using distillation between models with different tokenizers introduces two key challenges:
@@ -273,7 +273,7 @@ The tables also show that the tokenizer between Qwen2.5 and Qwen3 versions diffe
 ### GKD with the Same Tokenizer
-Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
 The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.

   downloadable
   loading="lazy"
   alt="ULD Diagram"
+  caption={'Figure 1: Previous work, ULD by Boizard et al. demonstrates offline distillation on student and teacher models with unmatched tokenizers. GOLD extends their method to the on-policy setting and addresses two weaknesses: token alignment in step 3 and logit alignment in step 4.'}
 />
 ULD showed that using distillation between models with different tokenizers introduces two key challenges:
 ### GKD with the Same Tokenizer
+Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al. [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
 The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.