trl-mcsd / docs /source /gfpo.md
ihbkaiser's picture
Implement MCSD for experimental SDPO
1fa3c6c verified

GFPO

This feature implements the GFPO algorithm to enforce concise reasoning in the model's output generation, as proposed in the paper Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning.

Usage

To activate GFPO in [GFPOTrainer]:

  • set num_remains_in_group in [GFPOConfig]
  • define a group filter function and set it to group_filter_func in [GFPOTrainer]. group_filter_func will score the num_generations completions and The GFPOTrainer filters groups according to their scores to get top num_remains_in_group completions as a new group. Model will be trained on the filtered group.
# train_gfpo.py
from trl.experimental.gfpo import GFPOConfig, GFPOTrainer

# dummy group filter to scores the completions based on its indice in group
class GroupFilter:
    def __call__(self, group_completions, group_rewards, **kwargs):
        group_scores = []
        for completions, rewards in zip(group_completions, group_rewards):
            scores = [float(i) for i in range(len(completions))]
            group_scores.append(scores)
        return group_scores

training_args = GFPOConfig(
    output_dir="Qwen3-0.6B-GFPO",
    per_device_train_batch_size=4,
    num_remains_in_group=2,
    bf16=True,
)
trainer = GFPOTrainer(
    model="Qwen/Qwen3-0.6B",
    reward_funcs=...,
    train_dataset=...,
    args=training_args,
    group_filter_func=GroupFilter(),
)
trainer.train()

GFPOTrainer

[[autodoc]] experimental.gfpo.GFPOTrainer - train - save_model - push_to_hub

GFPOConfig

[[autodoc]] experimental.gfpo.GFPOConfig