Papers
arxiv:2607.01763

Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Published on Jul 2
· Submitted by
Meng Wang
on Jul 3
Authors:
,
,
,
,
,
,
,
,
,

Abstract

On-policy self-distillation in continual post-training accelerates in-domain specialization but fails to prevent forgetting and can collapse in out-of-distribution scenarios, indicating that on-policy data alone is insufficient for continual learning.

Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at https://github.com/Moenupa/SDPO-CL.

Community

Denser Supervision ≠ Better Performance, as we found SDPO suffers forgetting much more than GRPO.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.01763 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.01763 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.01763 in a Space README.md to link it from this page.

Collections including this paper 4