| # GSPO-token | |
| In the paper [Group Sequence Policy Optimization](https://huggingface.co/papers/2507.18071), the authors propose a token-level objective variant to GSPO, called GSPO-token. To use GSPO-token, you can use the `GRPOTrainer` class in `trl.experimental.gspo_token`. | |
| ## Usage | |
| ```python | |
| from trl.experimental.gspo_token import GRPOTrainer | |
| from trl import GRPOConfig | |
| training_args = GRPOConfig( | |
| importance_sampling_level="sequence_token", | |
| ... | |
| ) | |
| ``` | |
| > [!WARNING] | |
| > To leverage GSPO-token, the user will need to provide the per-token advantage \\( \hat{A_{i,t}} \\) for each token \\( t \\) in the sequence \\( i \\) (i.e., make \\( \hat{A_{i,t}} \\) varies with \\( t \\)—which isn't the case here, \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation. | |
| ## GRPOTrainer | |
| [[autodoc]] experimental.gspo_token.GRPOTrainer | |
| - train | |
| - save_model | |
| - push_to_hub | |