on-policy-distillation

Running

App Files Files Community

sergiopaniego HF Staff commited on Oct 29, 2025

Commit

cc26bbe

verified ·

1 Parent(s): a16c36b

General On-Policy Logit Distillation

Browse files

Files changed (1) hide show

app/src/content/article.mdx +3 -3

app/src/content/article.mdx CHANGED Viewed

@@ -129,9 +129,9 @@ Figure 2: Diagram of sequence and vocabulary misalignments caused by differences
 ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution — **General On-Policy Logit Distillation (GOLD)** — which extends ULD into the on-policy setting with improved alignment techniques.
-## General On-policy Logit Distillation (GOLD)
-While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General on-policy Logit Distillation (GOLD), an algorithm that extends ULD by introducing improved vocabulary alignment techniques.
 ### Sequence Alignment
@@ -433,7 +433,7 @@ accelerate launch \
 ## Conclusion
-In this post, we introduced General On-policy Logit Distillation (GOLD), a new method that enables effective on-policy knowledge distillation between models, even when the teacher and student do not share the same tokenizer vocabulary. This overcomes a significant limitation of existing on-policy methods like GKD, which require matched tokenizers.
 GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that sums log probabilities of mismatched tokens. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.

 ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution — **General On-Policy Logit Distillation (GOLD)** — which extends ULD into the on-policy setting with improved alignment techniques.
+## General On-Policy Logit Distillation (GOLD)
+While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General On-Policy Logit Distillation (GOLD), an algorithm that extends ULD by introducing improved vocabulary alignment techniques.
 ### Sequence Alignment
 ## Conclusion
+In this post, we introduced General On-Policy Logit Distillation (GOLD), a new method that enables effective on-policy knowledge distillation between models, even when the teacher and student do not share the same tokenizer vocabulary. This overcomes a significant limitation of existing on-policy methods like GKD, which require matched tokenizers.
 GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that sums log probabilities of mismatched tokens. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.