File size: 34,601 Bytes
e86b529
8fcddf8
7855f28
7047fe6
e86b529
7855f28
 
 
 
 
 
 
 
 
 
 
 
fc25e63
 
 
 
 
 
 
 
 
 
 
 
de727cf
 
 
 
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a189875
 
 
7855f28
a189875
 
e86b529
a189875
e86b529
 
4bed84f
b89a58f
c0c2301
26ddcbc
 
 
 
 
 
4bed84f
145ff8f
7855f28
 
 
 
6f03636
7855f28
 
 
504cc7d
7855f28
 
 
 
 
 
 
 
 
 
 
6f03636
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36766c8
7855f28
b89a58f
26ddcbc
b89a58f
 
 
 
 
7855f28
9e9d406
7855f28
 
 
 
 
b89a58f
26ddcbc
b89a58f
 
 
 
 
1473b99
b89a58f
7855f28
 
 
 
 
 
 
 
b89a58f
26ddcbc
b89a58f
 
 
 
 
26ddcbc
b89a58f
7855f28
9e9d406
7855f28
cc26bbe
7855f28
cc26bbe
7855f28
 
 
 
 
 
 
 
 
 
e74676c
7855f28
e74676c
7855f28
b89a58f
26ddcbc
b89a58f
 
 
 
 
 
 
7855f28
9e9d406
7855f28
 
 
9e9d406
7855f28
 
 
 
 
 
 
b89a58f
26ddcbc
b89a58f
 
 
 
 
 
 
7855f28
9e9d406
7855f28
 
 
 
 
6f03636
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e26cb7f
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
899c610
7855f28
 
 
1473b99
7855f28
 
 
ef76aaf
 
 
7855f28
 
 
 
 
 
 
ef76aaf
 
 
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef76aaf
 
 
7855f28
9e9d406
7855f28
 
 
ffb28aa
7855f28
 
 
 
 
 
 
 
 
9acb9fc
7855f28
ef76aaf
 
 
7855f28
 
 
 
 
26ddcbc
 
 
 
 
 
 
 
 
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fc1dfc
7855f28
 
 
 
 
 
 
 
 
f07c960
7855f28
 
 
 
 
 
860af84
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b79479
a189875
 
860af84
7855f28
 
a16c36b
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e74676c
 
7855f28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e74676c
7855f28
5b79479
a189875
7855f28
a189875
cc26bbe
a189875
e74676c
a189875
7855f28
a189875
7855f28
a189875
7855f28
e86b529
e74676c
e86b529
7855f28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
---
title: "Unlocking On-Policy<br/> Distillation for Any Model Family"
subtitle: "Apply on-policy distillation to models from different families"
description: "Unlocking On-Policy Distillation for Any Model Family"
authors:
  - name: Carlos Miguel Patiño
    url: 'https://huggingface.co/cmpatino'
    affiliations:
      - 1
  - name: Kashif Rasul
    url: 'https://huggingface.co/kashif'
    affiliations:
      - 1
  - name: Quentin Gallouédec
    url: 'https://huggingface.co/qgallouedec'
    affiliations:
      - 1
  - name: Ben Burtenshaw
    url: 'https://huggingface.co/burtenshaw'
    affiliations:
      - 1
  - name: Sergio Paniego
    url: 'https://huggingface.co/sergiopaniego'
    affiliations:
      - 1
  - name: Vaibhav Srivastav
    url: 'https://huggingface.co/reach-vb'
    affiliations:
      - 1
  - name: Thibaud Frere
    url: 'https://huggingface.co/tfrere'
    affiliations:
      - 1
  - name: Ed Beeching
    url: 'https://huggingface.co/edbeeching'
    affiliations:
      - 1
  - name: Lewis Tunstall
    url: 'https://huggingface.co/lewtun'
    affiliations:
      - 1
  - name: Leandro von Werra
    url: 'https://huggingface.co/lvwerra'
    affiliations:
      - 1
  - name: Thomas Wolf
    url: 'https://huggingface.co/thomwolf'
    affiliations:
      - 1
affiliations:
  - name: "Hugging Face"
    url: "https://huggingface.co"
published: "Oct. 29, 2025"
tags:
  - research
tableOfContentsAutoCollapse: true
pdfProOnly: false
---

import HtmlEmbed from '../components/HtmlEmbed.astro'
import Image from '../components/Image.astro'
import Accordion from '../components/Accordion.astro'
import qwenTable from './assets/image/qwen-table.png'
import uldDiagram from './assets/image/uld-diagram.png'
import tokenizationDiagram from './assets/image/tokenization-diagram.png'
import sequenceAlignment from './assets/image/sequence-alignment.png'
import vocabAlignment from './assets/image/vocab-alignment.png'
import domainDistillationImage from './assets/image/thinking-machines-table.png'

## Introduction
On-policy distillation is a highly effective strategy for compressing LLMs, as recently highlighted by [Thinking Machines' excellent blog post.](https://thinkingmachines.ai/blog/on-policy-distillation/) The technique trains a small "student" model by transferring knowledge from a high-performing "teacher" model's probability distribution. This allows the student to emulate the teacher's task performance, while significantly reducing size and latency.

In this blog post, we introduce **General On-Policy Logit Distillation (GOLD)**, our method for extending on-policy distillation to address a fundamental weakness: the requirement that the teacher and student models must share the *same* tokenizer vocabulary.

Building on Universal Logit Distillation (ULD) [@boizard2025crosstokenizerdistillationuniversallogit], GOLD is highly effective for complex, multi-step reasoning tasks, such as math. Our results show GOLD performs better than ULD and even GRPO.

Our key contributions are:

- Providing an open-source implementation of on-policy distillation methods in TRL ([GKD](https://huggingface.co/docs/trl/en/gkd_trainer) and [GOLD](https://huggingface.co/docs/trl/main/en/gold_trainer)) and proving they work for multiple model combinations.
- Extending ULD to the on-policy setting, where we sample completions from the student and align them to the teacher's distribution.
- Implementing new sequence and vocabulary alignment methods that improve distillation performance when the student and the teacher have different tokenizers.

With this foundation in place, let’s step back to review the broader landscape of knowledge distillation methods - how on-policy approaches emerged, and why extending them beyond shared tokenizers is critical.

## Distillation Methods

### Off-policy vs. on-policy distillation

There are two main type of distillation: off-policy and on-policy. Off-policy distillation trains a student model on fixed data (typically the teacher's precomputed logits or text completions), while on-policy distillation involves the teacher providing feedback to the student's own outputs. 

Generalised Knowledge Distillation (GKD) [@agarwal2024onpolicydistillationlanguagemodels] unifies these approaches under a common framework by supporting a range of loss functions that enable training on both static teacher data and trajectories generated by the student. The GKD paper shows that on-policy distillation typically outperforms off-policy methods: a result we confirm later in this post.

On-policy distillation's advantage is twofold. First, as the student model improves, its generations create progressively higher-quality training data, forming a positive feedback loop. Second, this “context alignment” forces the student to learn from the same types of errors and successes it will encounter during inference, rather than from completions generated only by the teacher.

GKD controls this on-policy vs. off-policy data mixture via the $\lambda$ parameter, where $\lambda=1$ is fully on-policy and  $\lambda=0$ is fully offline as shown in the equation below

$$
\mathcal{L}_{GKD} = (1 - \lambda) \mathcal{L}_{SD} + \lambda \mathcal{L}_{OD}
$$

where $\mathcal{L}_{SD}$ is the supervised distillation (SD) that leverages off-policy generations from the teacher and $\mathcal{L}_{OD}$ is the on-policy distillation (OD) using student generations and feedback from the teacher’s logits[^f1].

When compared to RL, GKD also has two main benefits: 

1. We don’t need to rely on a reward function that gives sparse feedback
2. The method works for small models which initially have low performance in the task we're trying to optimise for. 

The reward function requires either a verifiable task or training a reward model to score the completion and only gives feedback about the outcome. There is no explicit information about which part of the process were correct and which require adjustments. 

On-policy distillation overcomes this limitation by providing feedback from a strong teacher at the *token level*. This approach is especially effective for smaller models, as demonstrated in the Qwen3 [@yang2025qwen3technicalreport] results below, where on-policy distillation outperforms RL at a fraction of the compute budget:

<Image
  src={qwenTable}
  layout="fixed"
  zoomable
  loading="lazy"
  alt="Qwen3's Table"
/>

While GKD establishes a strong foundation for on-policy training, it assumes both models share a tokenizer, a practical constraint we’ll now address through Universal Logit Distillation (ULD).

### Universal logit distillation

The main limitation with all on-policy distillation methods is that they assume the use of the same tokenizer for both the student and the teacher. The current AI ecosystem spans different model families such as [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm3), [Llama](https://huggingface.co/collections/meta-llama/llama-32), [Qwen](https://huggingface.co/collections/Qwen/qwen3), and [Gemma](https://huggingface.co/collections/google/gemma-3-release), each with their own strengths and shortcomings. Each model family, and even different versions within the same family, uses its own tokenizer, so requiring a single tokenizer can be overly restrictive when selecting student-teacher pairings. Recent work, such as **Universal Logit Distillation (ULD),** lifts the tokenizer restriction by showing distillation can be performed without needing a perfect alignment between teacher and student vocabularies, albeit in an offline setting.

<Image
  src={uldDiagram}
  layout="fixed"
  zoomable
  downloadable
  loading="lazy"
  alt="ULD Diagram"
  caption={'Figure 1: Previous work, ULD by Boizard et al. demonstrates offline distillation on student and teacher models with unmatched tokenizers. GOLD extends their method to the on-policy setting and addresses two weaknesses: token alignment in step 3 and logit alignment in step 4.'}
/>

ULD showed that using distillation between models with different tokenizers introduces two key challenges:

1. Sequence misalignment: tokenizers split text differently. As shown in Figure 2, Tokenizer A might create a single "Hugging Face" token, while Tokenizer B creates two separate tokens.
2. Vocabulary misalignment: the same token string receives different IDs. In Figure 1, "awesome!" is ID=2 in Tokenizer A but ID=0 in Tokenizer B.

As shown in the figure below, this token ID mismatch results in different token sequences for the exact same text, where “Hugging Face is awesome!” corresponds to [3, 1, 2] for Tokenizer A and [2, 3, 1, 0] for Tokenizer B. ULD handles these issues by truncating sequences to the minimum length and by sorting and padding the smaller softmax vector to align vocabularies.

<Image
  src={tokenizationDiagram}
  layout="fixed"
  zoomable
  downloadable
  loading="lazy"
  alt="Tokenization Diagram"
  caption={'Figure 2: Diagram of sequence and vocabulary misalignments caused by differences between two tokenizers. Tokenizer A has fewer elements in its vocabulary and different token IDs when compared to tokenizer B. The differences cause the same text ("Hugging Face is awesome!") to be represented by token ID sequences with different lengths and elements.'}
/>

ULD lifts the tokenizer restriction but remains limited to offline setups. Next, we introduce our core contribution, General On-Policy Logit Distillation (GOLD), which extends ULD into the on-policy setting with improved alignment techniques.

## General On-Policy Logit Distillation (GOLD)

While Universal Logit Distillation (ULD) allows training models with different tokenizers, its methods for sequence and vocabulary alignment have limitations. We developed General On-Policy Logit Distillation (GOLD), an algorithm that extends ULD by introducing improved vocabulary alignment techniques.

### Sequence Alignment

The first limitation we address is ULD's sequence alignment, which simply truncates sequences to the minimum tokenized length. This simple approach causes two problems:

1. It leads to information loss at the end of the text.
2. It can misalign tokens, causing the distillation of tokens with different semantic meanings at the same sequence index.

This alignment error worsens as tokenization differences increase because a single mismatch at the start of a sequence can propagate and create a cascading semantic error throughout the text.

Instead of truncating, our method identifies the token merges required to equalise the sequence lengths for both tokenizers. We then merge the probabilities at the corresponding positions by multiplying the marginal distribution by scalar conditional probabilities of the actual continuation tokens.

We perform the token merge through scalar multiplication to leverage the autoregressive nature of LLM sampling. Following the example in Figure 3, we want to merge "Hugging" and " Face" into one token for the sequence in blue. Using the conditional probabilities and the product rule[^f2], we can merge the probabilities and guarantee sequence alignment regardless of tokenizer discrepancies in the sequence dimension.

<Image
  src={sequenceAlignment}
  layout="fixed"
  zoomable
  downloadable
  loading="lazy"
  alt="GOLD-Sequence Alignment"
  caption={'Figure 3: Diagram highlighting the differences between ULD and GOLD in the sequence alignment step. Instead of truncating the sequence at the minimum sequence length, we first determine the merges that result in an aligned sequence length between the two tokenizer. We then calculate the sum of the logprobs for the merged token position to get a unified vector with the token distribution for that position in the sequence.'}
/>

Having resolved sequence mismatches through token merging, we now turn to vocabulary alignment, ensuring logits are comparable even when token IDs differ.

### Vocabulary Alignment

Our second extension improves the alignment in the vocabulary dimension by replacing the sorting operation with an operation that leverages a potential one-to-one mapping between the tokenizers. ULD assumes that we cannot map any token between tokenizers, so it performs a sorting operation in the softmax dimension after padding the logits to have the same size. The assumption behind this process is that the softmax distribution is the same, or at least similar, under a different permutation of token IDs.

We find this assumption to be reasonable, but we can exploit tokens present in both vocabularies with a different ID to avoid relying on sorting when there’s a direct mapping. For example, we know that “awesome!” is present in both vocabularies in Figure 4, but the token IDs differ. In GOLD’s approach, we find those mappings where the token exists in both vocabularies and apply the GKD loss that assumes the same tokenizer. We fall back to the sorting process from ULD for the items in the vocabulary without a perfect match, so that we still consider those unmatched tokens during learning. GOLD’s loss is then the result of adding $\mathcal{L}_{GKD}$ from the tokens with one-to-one mappings and $\mathcal{L}_{ULD}$ without a mapping. We allow defining the weights for each term our TRL implementation but include a default that worked well in our experiments.

$$
\mathcal{L}_{GOLD}(x,y) = w_1\mathcal{L}_{GKD} + w_2\mathcal{L}_{ULD}
$$

<Image
  src={vocabAlignment}
  layout="fixed"
  zoomable
  downloadable
  loading="lazy"
  alt="GOLD-Vocabulary Alignment"
  caption={'Figure 4: Diagram highlighting the differences between ULD and GOLD for the vocabulary alignment. GOLD tries to find 1:1 mapping between tokens in both tokenizers and applies the KL divergence loss from the GKD method. We fallback to the ULD process for tokens without a 1:1 mapping. The final loss is a sum of the two terms.'}
/>

With GOLD’s design clarified, we’ll now examine how we evaluated it in practice, detailing our experimental setup, tasks, and models.

## Experimental Setup

### Task Definition

We used a math game called Countdown [@gandhi2024streamsearchsoslearning], where the objective is to reach a target value using a group of numbers and four arithmetic operations (+, -, *, /). Additionally, the model must provide the answer using a specific format because we set a strict parser that considers the answer wrong if it can’t find the expected format. We only consider the answer as correct if it fulfils all the following conditions:

- Only uses each number once.
- The equation given by the model results in the target.
- The answer is an equation enclosed in the `<answer> </answer>` tags.

Below is an example of the system and user prompts we pass to the model for the task.

<iframe
  src="https://huggingface.co/datasets/HuggingFaceTB/Countdown-Task-GOLD/embed/viewer/all/train"
  frameborder="0"
  width="100%"
  height="560px"
></iframe>

### Dataset

We sourced all the prompts from the [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) dataset. Our full dataset contains 80k training prompts and 10k testing prompts selected randomly. We then generated responses from the `Qwen/Qwen2.5-7B-Instruct` and `Qwen/Qwen3-4B-Instruct-2507` teacher models, including only the prompts that had the correct answers from the teachers. Our published dataset contains 30.4k prompts for `Qwen/Qwen2.5-7B-Instruct` and 27.7k for `Qwen/Qwen3-4B-Instruct-2507` generations. We use the prompts in the training dataset with 30.4k prompts for all the on-policy experiments because we use the student’s generations instead of the teacher’s completions.

<iframe
  src="https://huggingface.co/datasets/HuggingFaceTB/Countdown-Task-GOLD/embed/viewer/verified_Qwen3-4B-Instruct-2507/train"
  frameborder="0"
  width="100%"
  height="560px"
></iframe>

### Models Used

To test the effects of model size, performance, and token similarity on KD, we established several student-teacher setups. The teachers were all Qwen models of varying sizes, while the students were from three different families: Qwen, Llama, and Gemma. This created a significant performance gap for distillation: all student models had a baseline Countdown score below 0.08, whereas the teachers' scores ranged from 0.35 to 0.76.

| Model Type | Model ID | Countdown Score |
| --- | --- | --- |
| Student | meta-llama/Llama-3.2-1B-Instruct | 0.016 |
| Student | Qwen/Qwen2.5-1.5B-Instruct | 0.076 |
| Student | google/gemma-3-1b-it | 0.023 |
| Teacher | Qwen/Qwen2.5-7B-Instruct | 0.3555 |
| Teacher | Qwen/Qwen3-4B-Instruct-2507 | 0.7145 |

### Tokenizer Similarity

We hypothesized that GOLD's performance would correlate with vocabulary similarity. To quantify this, we defined a tokenizer similarity metric using the Jaccard index (Intersection over Union, or IoU). In this context, the "intersection" is the count of tokens that can be matched between the two vocabularies, while the "union" is the total count of unique tokens across both.

Tables 1 and 2 below show the difference in tokenizer similarity when we enforce the same token IDs (first table) compared to when we match different token IDs when they correspond to the same token (second table). The `meta-llama/Llama-3.2-1B-Instruct` and `google/gemma-3-1b-it` tokenizers have 0 similarity with all the teachers in the first case, but we increase it to 0.64 and 0.063 in the second case, respectively.

The tables also show that the tokenizer between Qwen2.5 and Qwen3 versions differs by only a few tokens. In fact, the only difference between the two tokenizers is that Qwen3 is the same tokenizer as Qwen2.5 with four additional tokens `('<think>', '<tool_response>', '</tool_response>', '</think>')`. Since the tokenizer for Qwen3 fully contains the tokenizer from Qwen2.5, we can treat the two tokenizers as equivalent for our experiments.

**Table 1: Strict Matching** 

| Student Model | Qwen/Qwen2.5-7B-Instruct | Qwen/Qwen2.5-32B-Instruct | Qwen/Qwen3-4B-Instruct-2507 |
| --- | --- | --- | --- |
| meta-llama/Llama-3.2-1B-Instruct | 0 | 0 | 0 |
| google/gemma-3-1b-it | 0 | 0 | 0 |
| Qwen/Qwen2.5-1.5B-Instruct | 1.0 | 1.0 | 0.999974 |

**Table 2: Token Mapping**

| Student Model | Qwen/Qwen2.5-7B-Instruct | Qwen/Qwen2.5-32B-Instruct | Qwen/Qwen3-4B-Instruct-2507 |
| --- | --- | --- | --- |
| meta-llama/Llama-3.2-1B-Instruct | 0.64 | 0.64 | 0.64 |
| google/gemma-3-1b-it | 0.063 | 0.063 | 0.063 |
| Qwen/Qwen2.5-1.5B-Instruct | 1.0 | 1.0 | 0.999974 |

## Experiments

### GKD with the Same Tokenizer

Our first goal was to validate our GKD implementation by comparing our results with those reported by Agarwal et al. [@agarwal2024onpolicydistillationlanguagemodels]. We focused on comparing the performance of combining on-policy and off-policy learning through ablations of five different $\lambda$ values, as shown in Figure 5. We used `Qwen/Qwen3-4B-Instruct-2507` as a teacher and `Qwen/Qwen2.5-1.5B-Instruct` as a student. For the offline learning, we generated completions to the prompts using `Qwen/Qwen3-4B-Instruct-2507` beforehand to speed up the training process. We set the temperature $\gamma=1$ for the student generations and used the forward KL divergence ($\beta=0$)[^f3] in $\mathcal{L}_{OD}$.

The results confirm that using at least some degree of on-policy training outperforms the SFT setup. We also see a trend of better performance as we increase $\lambda$, with fully on-policy achieving the best overall performance. This behavior confirms the hypothesis that fully on-policy training is better than training with offline data when using models with the same tokenizer.

---
<HtmlEmbed id="d3-lambda-ablations" src="d3-lambda-ablations.html" desc="Figure 5: Ablation of the lambda parameter, that controls the blend of the on-policy loss (lambda=1.0) and supervised loss (lambda=0.0)." />
---

### Distilled teacher knowledge

After testing multiple configurations, we achieved a setup that consistently distilled over 80% of a teacher's performance on the Countdown task. This high distillation ratio held true across multiple teacher models of different sizes (as shown in Figure 6), validating our on-policy GKD implementation.

These results underscore a fundamental point: a student's performance is effectively capped by the teacher's capabilities. This highlights the importance of selecting a strong teacher model to maximize student performance.

---
<HtmlEmbed id="d3-jsd-different-teachers" src="d3-jsd-different-teachers.html" desc="Figure 6: Distillation is stable at different model scales, with Qwen/Qwen2.5-1.5B-Instruct as the student and either Qwen/Qwen2.5-7B-Instruct or Qwen/Qwen3-4B-Instruct-2507 as the teacher. In both cases we are able to recover over 80% of the teacher’s performance, which points to the importance of choosing a strong teacher to achieve the best results in KD tasks." />
---

These results validate our GKD implementation. The next question is: can on-policy distillation still succeed when teacher and student use *different tokenizers*?

## On-Policy distillation works with different tokenizers

While our GKD implementation recovered over 80% of the teacher's performance, it was limited to teacher-student pairs with matching tokenizers. Our next experiments addressed this limitation by testing distillation across different model families, which use different tokenizers.

This scenario requires methods that can handle vocabulary and sequence misalignments. We therefore compared the baseline ULD method with our proposed GOLD method to evaluate their effectiveness.

### Tokenizer similarity impacts performance

Tokenizer similarity dictates the extent to which sequence and vocabulary alignment are required. We hypothesized that lower similarity would correlate with lower task performance, and our results confirm this: GOLD's performance on the Countdown task declines as tokenizer similarity decreases.

This decline is an expected trade-off, as the alignment process for divergent vocabularies inevitably introduces some noise. However, even with this effect, we will show that GOLD (at 0.64 similarity) still outperforms RL methods.

| Model | Performance on Countdown | Similarity with Qwen3-4B-Instruct-2507 |
| --- | --- | --- |
| Qwen/Qwen2.5-1.5B-Instruct | 0.6515 | 0.999974 |
| meta-llama/Llama-3.2-1B-Instruct | 0.4235 | 0.64 |
| google/gemma-3-1b-it | 0.0305 | 0.063 |

### GOLD outperforms ULD

We tested our extensions by training `meta-llama/Llama-3.2-1B-Instruct` (student) with `Qwen/Qwen3-4B-Instruct-2507` (teacher). The results in Figure 7 show a substantial performance difference between the methods:

- GOLD improved the student's initial performance by 25% and recovered 60% of the teacher's performance.
- ULD improved the student by only 5% and recovered just 10% of the teacher's performance.

This difference is attributable to GOLD's improved alignment techniques. This specific student-teacher pair had 0 similarity under a strict ID match, but our token content matching (from Figure 4) increased this to 0.64. This, combined with our improved sequence alignment (from Figure 3), enabled effective knowledge transfer where ULD failed and produced results competitive with RL methods.

---
<HtmlEmbed id="d3-uld-ablations" src="d3-uld-ablations.html" desc="Figure 7: GOLD performs better than ULD when distilling Qwen/Qwen3-4B-Instruct-2507 into meta-llama/Llama-3.2-1B-Instruct. The plot also shows the long warmup in both cases because the model performance has a noticeable improvement only after the 1000th step." />
---

Having shown that GOLD handles tokenizer differences effectively, we now benchmark it against an RL algorithm, GRPO, test its efficiency and performance.

## On-policy distillation outperforms GRPO

On-policy distillation uses student-generated completions to progressively update the training data. Having established this approach is superior to offline methods like SFT (when tokenizers match), we next compared it to other on-policy methods, specifically Group Relative Policy Optimization (GRPO). GRPO is an RL method introduced in the DeepSeek-Math paper [@shao2024deepseekmathpushinglimitsmathematical] and later popularized by the Deepseek R1 release [@deepseekai2025deepseekr1incentivizingreasoningcapability].

We followed [Philipp Schmid’s tutorial](https://www.philschmid.de/mini-deepseek-r1) of how to train GRPO for the Countdown task and compared it to the performance of KD distillation. Our reward function was a sum of three components:

1. **Format:** +1 if the response included the <answer></answer> tags correctly.
2. **Following Rules:** +1 if the model followed the rule of using the numbers provided in the prompt and only using each number once.
3. **Correct Equation:** +1 if the equation is correct.

The implementation in the tutorial joined the Format and Following Rules reward into a single function, but we found that the results were better when splitting the conditions into two separate reward functions.

Figure 4 shows our results for the scenario with the same tokenizer (above) and different tokenizers (below). For the same tokenizer scenario, we see a that KD outperforms GRPO by a 2x performance! The scenario with different tokenizers has a narrower performance gap between KD and GRPO, but still GOLD performs 20% better than GRPO. These results align with [Qwen 3 Technical Report](https://huggingface.co/papers/2505.09388), where on-policy distillation performs similarly or better than RL. However, our results go one step further because we perform better than RL using a student-teacher pairing from different model families and with different tokenizers.

---
<HtmlEmbed id="d3-grpo-comparison" src="d3-grpo-comparison.html" desc="Figure 8: GKD and GOLD perform better than GRPO when training meta-llama/Llama-3.2-1B-Instruct. The gains from distillation are more clear in the GKD because we are able to distill the teacher better, but we still perform better than GRPO with our GOLD approach." />
---

Beyond mathematical reasoning, on-policy distillation also applies to *domain-specific fine-tuning*. Let’s explore how the same ideas improve personalization and task adaptation.

## **Distillation for Domain**

In the [Thinking Machines blog post](https://thinkingmachines.ai/blog/on-policy-distillation/#distillation-for-personalization), the authors distilled a language more for personalisation. They improved a [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) model on an internal domain dataset and evaluation benchmark and restored it's ability on IFEval, the instruction following benchmark. This is useful because models can often lose their instruction following abilities during domain specific fine-tuning with SFT. Thinking Machines achieved this by interleaving phases on continued pre-training on domain specific data (mid-training) and On-policy Distillation with a high quality chat dataset, [`allenai/tulu-3-sft-mixture`](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture). As the table below shows, chat performance is restored following on-policy distillation.

<Image
  src={domainDistillationImage}
  layout="fixed"
  zoomable
  loading="lazy"
  alt="Domain Distillation Results"
/>


To make these results reproducible, we’ll now walk through how to implement the full process using open-source datasets and the TRL framework.

### Reproducing in TRL

We can reproduce the above process in TRL and share the implementation using open models and datasets! 

We’ve made some adaptations from the Thinking Machine experiment to use datasets and benchmarks that are available, instead of the “internal document dataset and benchmark”:

- The [`open-r1/codeforces`](https://huggingface.co/datasets/open-r1/codeforces) dataset as a domain specific dataset.
- The `livecodebench` evaluation benchmark to align with the Codeforces competitive coding task above.
- The same [`allenai/tulu-3-sft-mixture`](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) dataset and the IFEVal Benchmark.
- The `Qwen/Qwen3-4B-Instruct-2507` model.

### Supervised Fine-Tuning on [`open-r1/codeforces`](https://huggingface.co/datasets/open-r1/codeforces)

We fine-tuned the `Qwen/Qwen3-4B-Instruct-2507` model on [`open-r1/codeforces`](https://huggingface.co/datasets/open-r1/codeforces) with [`SFTTrainer`](https://huggingface.co/docs/trl/en/sft_trainer) which improved performance by ***35.1%*** to ***40.3%*** on livecodebench. However, the model’s IFEval score fell from ***83.4%*** to ***79.5%***, which is common in domain specific fine-tuning.

Note that we trained this model for 1k steps and stopped early. For a more complete study of fine-tuning for advanced reasoning tasks, check out this [blog post from the Open R1 project](https://huggingface.co/blog/open-r1/update-3).

### Generalized Knowledge Distillation on [`allenai/tulu-3-sft-mixture`](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture)

Starting from the above checkpoint from SFT, we used the [`GKDTrainer`](https://huggingface.co/docs/trl/en/gkd_trainer) with the [`allenai/tulu-3-sft-mixture`](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) dataset which improved performance on IFEval ***79.5%*** to ***82.8%*** whilst maintaining an approximate livecodebench score of ***39.8%***.

<HtmlEmbed frameless id="domain-distillation" src="domain-distillation.html" desc="Results from first finetuning on Codeforces data to improve LCB and then recovering performance on IFEval by distilling the initial Qwen3-4B model." />

### Building it for yourself

If you want to try out knowledge distillation for yourself on your own use case, or a dataset from the hub, the recipe is available below.


<Accordion title="SFT Recipe">
```bash
accelerate launch \
  --config_file examples/accelerate_configs/multi_gpu.yaml trl/scripts/sft.py \
  --model_name_or_path Qwen/Qwen3-4B-Instruct-2507 \
  --dtype auto \
  --attn_implementation kernels-community/flash-attn \
  --dataset_name open-r1/codeforces-cots \
  --dataset_config solutions_decontaminated \
  --bf16 \
  --gradient_checkpointing \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 32 \
  --learning_rate 1e-5 \
  --num_train_epochs 1 \
  --max_length 16384 \
  --logging_steps 1 \
  --report_to trackio \
  --trackio_project Qwen3-4B-SFT-Codeforces \
  --output_dir data/Qwen3-4B-SFT-Codeforces \
  --push_to_hub \
  --hub_model_id <your-username>/Qwen3-4B-SFT-Codeforces \
  --seed 42 \
  --warmup_ratio 0.05 \
  --lr_scheduler_type cosine_with_min_lr \
  --use_liger_kernel
```
</Accordion>


<Accordion title="Distillation Recipe">
```bash
accelerate launch \
  --config_file examples/accelerate_configs/multi_gpu.yaml trl/experimental/gold/gold.py \
  --model_name_or_path <sft-model> \
  --dtype auto \
  --attn_implementation kernels-community/flash-attn \
  --dataset_name allenai/tulu-3-sft-mixture \
  --dataset_train_split train \
  --bf16 \
  --learning_rate 1e-7 \
  --gradient_checkpointing \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 64 \
  --num_train_epochs 1 \
  --eval_strategy steps \
  --eval_steps 100 \
  --temperature 1.0 \
  --top_p 0.95 \
  --top_k 0 \
  --max_completion_length 2048 \
  --max_length 2560 \
  --lmbda 0.25 \
  --beta 0.0 \
  --use_uld_loss \
  --use_extended_uld \
  --uld_use_hybrid_loss \
  --uld_crossentropy_weight 0.0 \
  --uld_distillation_weight 1.0 \
  --uld_student_temperature 1.0 \
  --uld_teacher_temperature 1.0 \
  --uld_hybrid_unmatched_weight 1.0 \
  --uld_hybrid_matched_weight 1.0 \
  --teacher_model_name_or_path Qwen/Qwen3-4B-Instruct-2507 \
  --logging_steps 1 \
  --push_to_hub \
  --hub_model_id <your-username>/Qwen3-4B-GKD-Tulu \
  --report_to trackio \
  --trackio_project Qwen3-4B-GKD-Tulu \
  --seed 42 \
  --warmup_ratio 0.05 \
  --lr_scheduler_type cosine_with_min_lr
```
</Accordion>

## Conclusion

In this post, we introduced General On-Policy Logit Distillation (GOLD), a new method that enables effective on-policy knowledge distillation between models, even when the teacher and student do not share the same tokenizer vocabulary. This overcomes a significant limitation of existing on-policy methods like GKD, which require matched tokenizers.

GOLD builds upon the offline ULD method but extends it to the on-policy setting and, critically, addresses its two main weaknesses. First, we replace ULD's naive sequence truncation with a token-merging strategy that multiplies marginal distributions by scalar conditional probabilities. Second, we implement a hybrid vocabulary alignment method that uses a direct-mapping loss for shared tokens and falls back to ULD's sorting method only for unmatched tokens.

Our experiments on the Countdown math task confirm GOLD's advantages. We showed that GOLD significantly outperforms the original offline ULD implementation, recovering 60% of the teacher's performance versus ULD's 10%. Furthermore, GOLD proved superior to other on-policy methods, outperforming a supervised fine-tuning baseline by 15% and a GRPO baseline by 2x. Even in the difficult cross-tokenizer scenario, GOLD still outperformed GRPO by 20%.

These findings demonstrate that GOLD is a powerful and flexible technique for model distillation. It provides a path to distill knowledge from any high-performing teacher to any student, regardless of their tokenizer, offering a more effective and token-efficient alternative to reinforcement learning.

[^f1]: The full GKD loss is then formally defined as: $$\mathcal{L}_{GKD} := (1-\lambda) \mathbb{E}_{(x,y)\sim (X,Y)}[\mathcal{D}_{JSD(\beta)}] + \lambda  \mathbb{E}_{x \sim X}[\mathbb{E}_{y \sim p_{S}(.|x)}[\mathcal{D}_{JSD(\beta)}]].$$

[^f2]: The details of why we can merge the probabilities using the chain rule. For the merged distribution at position i: $$P_{\text{merged}}(y) = P(y \mid x) \times P(\text{token}_1 \mid x) \times P(\text{token}_2 \mid \text{token}_1, x) \times \dots$$ This correctly computes the joint probability of the actual generated sequence while providing a reasonable approximation for counterfactual tokens.

[^f3]: The $\beta$ parameter then controls the generalized Jensen-Shannon divergence between the student (S) and teacher (T) distributions, calculated via the following loss summed over the sequence and averaged over the batch: $$\mathcal{D}_{\text{JSD}(\beta)}(p_S, p_T) = \beta \cdot D_{\text{KL}}(p_S \| \pi) + (1-\beta) \cdot D_{\text{KL}}(p_T \| \pi)$$ where $\pi = \beta \cdot p_S + (1-\beta) \cdot p_T$.