remove unrendered link
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -130,7 +130,7 @@ The main limitation with all on-policy distillation methods is that they assume
|
|
| 130 |
downloadable
|
| 131 |
loading="lazy"
|
| 132 |
alt="ULD Diagram"
|
| 133 |
-
caption={'Figure 1: Previous work, ULD by Boizard et al.
|
| 134 |
/>
|
| 135 |
|
| 136 |
ULD showed that using distillation between models with different tokenizers introduces two key challenges:
|
|
@@ -273,7 +273,7 @@ The tables also show that the tokenizer between Qwen2.5 and Qwen3 versions diffe
|
|
| 273 |
|
| 274 |
### GKD with the Same Tokenizer
|
| 275 |
|
| 276 |
-
Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
|
| 277 |
|
| 278 |
The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.
|
| 279 |
|
|
|
|
| 130 |
downloadable
|
| 131 |
loading="lazy"
|
| 132 |
alt="ULD Diagram"
|
| 133 |
+
caption={'Figure 1: Previous work, ULD by Boizard et al. demonstrates offline distillation on student and teacher models with unmatched tokenizers. GOLD extends their method to the on-policy setting and addresses two weaknesses: token alignment in step 3 and logit alignment in step 4.'}
|
| 134 |
/>
|
| 135 |
|
| 136 |
ULD showed that using distillation between models with different tokenizers introduces two key challenges:
|
|
|
|
| 273 |
|
| 274 |
### GKD with the Same Tokenizer
|
| 275 |
|
| 276 |
+
Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al. [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.
|
| 277 |
|
| 278 |
The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.
|
| 279 |
|