General On-Policy Logit Distillation
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -129,9 +129,9 @@ Figure 2: Diagram of sequence and vocabulary misalignments caused by differences
|
|
| 129 |
|
| 130 |
ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution — **General On-Policy Logit Distillation (GOLD)** — which extends ULD into the on-policy setting with improved alignment techniques.
|
| 131 |
|
| 132 |
-
## General On-
|
| 133 |
|
| 134 |
-
While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General
|
| 135 |
|
| 136 |
### Sequence Alignment
|
| 137 |
|
|
@@ -433,7 +433,7 @@ accelerate launch \
|
|
| 433 |
|
| 434 |
## Conclusion
|
| 435 |
|
| 436 |
-
In this post, we introduced General On-
|
| 437 |
|
| 438 |
GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that sums log probabilities of mismatched tokens. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.
|
| 439 |
|
|
|
|
| 129 |
|
| 130 |
ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution — **General On-Policy Logit Distillation (GOLD)** — which extends ULD into the on-policy setting with improved alignment techniques.
|
| 131 |
|
| 132 |
+
## General On-Policy Logit Distillation (GOLD)
|
| 133 |
|
| 134 |
+
While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General On-Policy Logit Distillation (GOLD), an algorithm that extends ULD by introducing improved vocabulary alignment techniques.
|
| 135 |
|
| 136 |
### Sequence Alignment
|
| 137 |
|
|
|
|
| 433 |
|
| 434 |
## Conclusion
|
| 435 |
|
| 436 |
+
In this post, we introduced General On-Policy Logit Distillation (GOLD), a new method that enables effective on-policy knowledge distillation between models, even when the teacher and student do not share the same tokenizer vocabulary. This overcomes a significant limitation of existing on-policy methods like GKD, which require matched tokenizers.
|
| 437 |
|
| 438 |
GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that sums log probabilities of mismatched tokens. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.
|
| 439 |
|