Buckets:
| # Experimental Features | |
| The `trl.experimental` namespace provides a minimal, clearly separated space for fast iteration on new ideas. | |
| > [!WARNING] | |
| > **Stability contract:** Anything under `trl.experimental` may change or be removed in *any* release (including patch versions) without prior deprecation. Do not rely on these APIs for production workloads. | |
| ## Current Experimental Features | |
| The following modules are currently available under [`trl.experimental`](https://github.com/huggingface/trl/tree/main/trl/experimental). | |
| This list is not exhaustive and may change at any time. | |
| ### BEMA for Reference Model | |
| This feature implements the BEMA algorithm to update the reference model during DPO training. | |
| ```python | |
| from trl.experimental.bema_for_ref_model import BEMACallback, DPOTrainer | |
| from datasets import load_dataset | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| pref_dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train") | |
| ref_model = AutoModelForCausalLM.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5") | |
| bema_callback = BEMACallback(update_ref_model=True) | |
| model = AutoModelForCausalLM.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5") | |
| tokenizer = AutoTokenizer.from_pretrained("trl-internal-testing/tiny-Qwen2ForCausalLM-2.5") | |
| tokenizer.pad_token = tokenizer.eos_token | |
| trainer = DPOTrainer( | |
| model=model, | |
| ref_model=ref_model, | |
| train_dataset=pref_dataset, | |
| processing_class=tokenizer, | |
| callbacks=[bema_callback], | |
| ) | |
| trainer.train() | |
| ``` | |
| ### GFPO | |
| This feature implements the GFPO algorithm to enforce concise reasoning in the model's output generation, as proposed in the paper [Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning](https://huggingface.co/papers/2508.09726). | |
| To activate GFPO in `GFPOTrainer`: | |
| - set `num_remains_in_group` in `GFPOConfig` | |
| - define a group filter function and set it to `group_filter_func` in `GFPOTrainer`. `group_filter_func` will score the `num_generations` completions and The GFPOTrainer filters groups according to their scores to get top `num_remains_in_group` completions as a new group. Model will be trained on the filtered group. | |
| ```python | |
| # train_gfpo.py | |
| from trl.experimental.gfpo import GFPOConfig, GFPOTrainer | |
| # dummy group filter to scores the completions based on its indice in group | |
| class GroupFilter: | |
| def __call__(self, group_completions, group_rewards, **kwargs): | |
| group_scores = [] | |
| for completions, rewards in zip(group_completions, group_rewards): | |
| scores = [float(i) for i in range(len(completions))] | |
| group_scores.append(scores) | |
| return group_scores | |
| training_args = GFPOConfig( | |
| output_dir="Qwen3-0.6B-GFPO", | |
| per_device_train_batch_size=4, | |
| num_remains_in_group=2, | |
| bf16=True, | |
| ) | |
| trainer = GFPOTrainer( | |
| model="Qwen/Qwen3-0.6B", | |
| reward_funcs=..., | |
| train_dataset=..., | |
| args=training_args, | |
| group_filter_func=GroupFilter(), | |
| ) | |
| trainer.train() | |
| ``` | |
| ### GSPO-token | |
| In the paper [Group Sequence Policy Optimization](https://huggingface.co/papers/2507.18071), the authors propose a token-level objective variant to GSPO, called GSPO-token. To use GSPO-token, you can use the `GRPOTrainer` class in `trl.experimental.gspo_token`. | |
| ```python | |
| from trl.experimental.gspo_token import GRPOTrainer | |
| from trl import GRPOConfig | |
| training_args = GRPOConfig( | |
| importance_sampling_level="sequence_token", | |
| ... | |
| ) | |
| ``` | |
| > [!WARNING] | |
| > To leverage GSPO-token, the user will need to provide the per-token advantage \\( \hat{A_{i,t}} \\) for each token \\( t \\) in the sequence \\( i \\) (i.e., make \\( \hat{A_{i,t}} \\) varies with \\( t \\)—which isn't the case here, \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation. | |
| ### GRPO With Replay Buffer | |
| This experimental trainer, trains a model with GRPO but replaces groups (and corresponding completions) that have 0 standard deviation with groups with high rewards and standard deviation that've been used to train a model in prior batches. | |
| #### Usage | |
| ```python | |
| from trl.experimental.grpo_with_replay_buffer import GRPOWithReplayBufferTrainer | |
| from datasets import load_dataset | |
| dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train") | |
| # Guarantee that some rewards have 0 std | |
| def custom_reward_func(completions, **kwargs): | |
| if torch.rand(1).item() < 0.25: | |
| return [0] * len(completions) # simulate some None rewards | |
| else: | |
| return torch.rand(len(completions)).tolist() | |
| training_args = GRPOWithReplayBufferConfig( | |
| output_dir=self.tmp_dir, | |
| learning_rate=1e-4, | |
| per_device_train_batch_size=4, | |
| num_generations=4, | |
| max_completion_length=8, | |
| replay_buffer_size=8, | |
| report_to="none", | |
| ) | |
| trainer = GRPOTrainer( | |
| model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5", | |
| reward_funcs=[custom_reward_func], | |
| args=training_args, | |
| train_dataset=dataset, | |
| ) | |
| previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()} | |
| trainer.train() | |
| ``` | |
| To silence the runtime notice: | |
| ```bash | |
| export TRL_EXPERIMENTAL_SILENCE=1 | |
| ``` | |
| ## Promotion Path (Simple) | |
| 1. **Prototype outside the main repo:** Start development in your own fork or a separate repository to iterate quickly. | |
| 2. **Experimental inclusion:** Once it’s ready for early users, move the idea into `trl.experimental.<feature>`. | |
| 3. **Improve:** Add tests, a short doc/example, and demonstrate the usage. | |
| 4. **Promote:** Once the API proves stable and there is clear interest or adoption from the community, move it into `trl.<feature>` (stable module). | |
| ## FAQ | |
| **Why not just use branches?** | |
| Because branches are not shipped to users; experimental code inside the package lets early adopters try things and give feedback. | |
| **Can these APIs change or vanish without warning?** | |
| Yes. Anything inside `trl.experimental` can change or disappear in *any* release. | |
| **Should I use this in production?** | |
| Only if you are fine with updating your code quickly when things change. | |
| **Will maintainers promptly fix issues in `trl.experimental`?** | |
| Not necessarily. The experimental module is a playground for new ideas, and maintainers may not prioritize bug fixes or feature requests there. Issues may remain unresolved until (or unless) the feature graduates to the stable API. | |
| <EditOnGithub source="https://github.com/huggingface/trl/blob/main/docs/source/experimental.md" /> |
Xet Storage Details
- Size:
- 6.53 kB
- Xet hash:
- 7e69ae869a0f9c4a138ac0af6c365e6eb85927c24e3d06ab0d24ee6cd176314d
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.