trl-mcsd / docs /source /gspo_token.md
ihbkaiser's picture
Implement MCSD for experimental SDPO
1fa3c6c verified

GSPO-token

In the paper Group Sequence Policy Optimization, the authors propose a token-level objective variant to GSPO, called GSPO-token. To use GSPO-token, you can use the GRPOTrainer class in trl.experimental.gspo_token.

Usage

from trl.experimental.gspo_token import GRPOTrainer
from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence_token",
    ...
)

To leverage GSPO-token, the user will need to provide the per-token advantage Ai,t^ \hat{A_{i,t}} for each token t t in the sequence i i (i.e., make Ai,t^ \hat{A_{i,t}} varies with t t —which isn't the case here, Ai,t^=Ai^ \hat{A_{i,t}}=\hat{A_{i}} ). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.

GRPOTrainer

[[autodoc]] experimental.gspo_token.GRPOTrainer - train - save_model - push_to_hub