arxiv:2603.25562

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Published on Mar 26

· Submitted by

Yuqian Fu on Mar 27

Institute of Automation,chinese academy of science

Upvote

Authors:

Yuqian Fu ,

Abstract

On-policy distillation for large language models faces challenges in long-horizon settings due to token-level signal fragility, which is addressed through improved estimation methods and implementation techniques.

AI-generated summary

On-policy distillation (OPD) is appealing for large language model (LLM) post-training because it evaluates teacher feedback on student-generated rollouts rather than fixed teacher traces. In long-horizon settings, however, the common sampled-token variant is fragile: it reduces distribution matching to a one-token signal and becomes increasingly unreliable as rollouts drift away from prefixes the teacher commonly visits. We revisit OPD from the estimator and implementation sides. Theoretically, token-level OPD is biased relative to sequence-level reverse-KL, but it has a much tighter worst-case variance bound; our toy study shows the same tradeoff empirically, with stronger future-reward coupling producing higher gradient variance and less stable learning. Empirically, we identify three failure modes of sampled-token OPD: an imbalanced one-token signal, unreliable teacher guidance on student-generated prefixes, and distortions caused by tokenizer or special-token mismatch. We address these issues with teacher top-K local support matching, implemented as truncated reverse-KL with top-p rollout sampling and special-token masking. Across single-task math reasoning and multi-task agentic-plus-math training, this objective yields more stable optimization and better downstream performance than sampled-token OPD.

View arXiv page View PDF Project page GitHub 7 Add to collection

Community

Yuqian-Fu

Paper author Paper submitter about 10 hours ago

On-policy distillation (OPD) trains a student on its own rollouts using teacher feedback[1][2][3]. In long-horizon LLM post-training, the common sampled-token implementation can be brittle.
From a bias-variance perspective, token-level OPD is biased relative to sequence-level reverse-KL, but it admits a much tighter worst-case variance bound. Our toy study shows that stronger future-reward coupling substantially increases gradient variance and destabilizes optimization.
In practice, brittleness comes from three sources: an imbalanced one-token learning signal, unreliable teacher guidance on student-generated prefixes, and tokenizer/special-token mismatch.
We replace the one-sample comparison with a teacher top-K truncated reverse-KL over local support, together with top-p rollouts and special-token masking. This yields more stable training and better results on both reasoning and agentic multi-task benchmarks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.25562

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.25562 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.25562 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.25562 in a Space README.md to link it from this page.