on-policy-distillation

Running

App Files Files Community

edbeeching commited on Oct 29, 2025

Commit

8e331e8

2 Parent(s): 145ff8f cc26bbe

Merge branch 'main' of https://huggingface.co/spaces/HuggingFaceH4/general-on-policy-logit-distillation

Browse files

Files changed (1) hide show

app/src/content/article.mdx +4 -4

app/src/content/article.mdx CHANGED Viewed

@@ -124,9 +124,9 @@ Figure 2: Diagram of sequence and vocabulary misalignments caused by differences
 ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution — **General On-Policy Logit Distillation (GOLD)** — which extends ULD into the on-policy setting with improved alignment techniques.
-## General On-policy Logit Distillation (GOLD)
-While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General on-policy Logit Distillation (GOLD), an algorithm that extends ULD by introducing improved vocabulary alignment techniques.
 ### Sequence Alignment
@@ -385,7 +385,7 @@ accelerate launch \
 ```bash
 accelerate launch \
-  --config_file examples/accelerate_configs/multi_gpu.yaml trl/scripts/gkd.py \
   --model_name_or_path <sft-model> \
   --dtype auto \
   --attn_implementation kernels-community/flash-attn \
@@ -428,7 +428,7 @@ accelerate launch \
 ## Conclusion
-In this post, we introduced General On-policy Logit Distillation (GOLD), a new method that enables effective on-policy knowledge distillation between models, even when the teacher and student do not share the same tokenizer vocabulary. This overcomes a significant limitation of existing on-policy methods like GKD, which require matched tokenizers.
 GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that sums log probabilities of mismatched tokens. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.

 ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution — **General On-Policy Logit Distillation (GOLD)** — which extends ULD into the on-policy setting with improved alignment techniques.
+## General On-Policy Logit Distillation (GOLD)
+While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General On-Policy Logit Distillation (GOLD), an algorithm that extends ULD by introducing improved vocabulary alignment techniques.
 ### Sequence Alignment
 ```bash
 accelerate launch \
+  --config_file examples/accelerate_configs/multi_gpu.yaml trl/experimental/gold/gold.py \
   --model_name_or_path <sft-model> \
   --dtype auto \
   --attn_implementation kernels-community/flash-attn \
 ## Conclusion
+In this post, we introduced General On-Policy Logit Distillation (GOLD), a new method that enables effective on-policy knowledge distillation between models, even when the teacher and student do not share the same tokenizer vocabulary. This overcomes a significant limitation of existing on-policy methods like GKD, which require matched tokenizers.
 GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that sums log probabilities of mismatched tokens. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.