arxiv:2607.01763

Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Published on Jul 2

· Submitted by

Authors:

Abstract

On-policy self-distillation in continual post-training accelerates in-domain specialization but fails to prevent forgetting and can collapse in out-of-distribution scenarios, indicating that on-policy data alone is insufficient for continual learning.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at https://github.com/Moenupa/SDPO-CL.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Moenupa

Paper submitter about 7 hours ago

•

edited about 7 hours ago

Denser Supervision ≠ Better Performance, as we found SDPO suffers forgetting much more than GRPO.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.01763 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.01763 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.01763 in a Space README.md to link it from this page.