arxiv:2602.01058

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Published on Feb 1

· Submitted by

Dylan on Feb 3

University of Illinois at Urbana-Champaign

Upvote

Authors:

Dylan Zhang ,

Abstract

Post-training of reasoning large language models can be improved by correcting distribution mismatches between supervised fine-tuning and reinforcement learning stages through importance sampling reweighting of the SFT loss.

AI-generated summary

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

View arXiv page View PDF Project page Add to collection

Community

shizhuo2

Paper submitter about 15 hours ago

A good objective for supervised post-training is commonly taken as one that optimizes for performance after supervised stage. But when this supervised stage is followed by an online RL stage, SFT stage gains may not be preserved after online RL. This paper experiments with a variety of supervised objectives, and finds that the out-of-the-box performance of these objectives often change after subsequent RL.

This highlights a mismatch between these two goals. This paper proposes a reweighing mechanism for standard supervised losses designed to weigh each token using the effect of learning on that token on RL stage. This paper presents an approach based inspired by off-policy evaluation to compute weights based on the likelihood of continuation from each starting point. The paper includes multiple practical variants based on that principle and demonstrates the effectiveness.