| | |
| |
|
| | TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model. |
| |
|
| | Check out a complete flexible example at [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py). |
| |
|
| | |
| |
|
| | The [`RewardTrainer`] expects a very specific format for the dataset since the model will be trained on pairs of examples to predict which of the two is preferred. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below: |
| |
|
| | <div style="text-align: center"> |
| | <img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%"> |
| | </div> |
| |
|
| | Therefore the final dataset object should contain two 4 entries at least if you use the default [`RewardDataCollatorWithPadding`] data collator. The entries should be named: |
| |
|
| | - `input_ids_chosen` |
| | - `attention_mask_chosen` |
| | - `input_ids_rejected` |
| | - `attention_mask_rejected` |
| |
|
| | |
| |
|
| | After preparing your dataset, you can use the [`RewardTrainer`] in the same way as the `Trainer` class from 🤗 Transformers. |
| | You should pass an `AutoModelForSequenceClassification` model to the [`RewardTrainer`], along with a [`RewardConfig`] which configures the hyperparameters of the training. |
| |
|
| | |
| |
|
| | Just pass a `peft_config` in the keyword arguments of [`RewardTrainer`], and the trainer should automatically take care of converting the model into a PEFT model! |
| |
|
| | ```python |
| | from peft import LoraConfig, task_type |
| | from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| | from trl import RewardTrainer, RewardConfig |
| |
|
| | model = AutoModelForSequenceClassification.from_pretrained("gpt2") |
| | peft_config = LoraConfig( |
| | task_type=TaskType.SEQ_CLS, |
| | inference_mode=False, |
| | r=8, |
| | lora_alpha=32, |
| | lora_dropout=0.1, |
| | ) |
| |
|
| | ... |
| |
|
| | trainer = RewardTrainer( |
| | model=model, |
| | args=training_args, |
| | tokenizer=tokenizer, |
| | train_dataset=dataset, |
| | peft_config=peft_config, |
| | ) |
| |
|
| | trainer.train() |
| |
|
| | ``` |
| |
|
| | |
| |
|
| | As in the [Llama 2 paper](https://huggingface.co/papers/2307.09288), you can add a margin to the loss by adding a `margin` column to the dataset. The reward collator will automatically pass it through and the loss will be computed accordingly. |
| |
|
| | ```python |
| | def add_margin(row): |
| | |
| | return {'margin': row['score_chosen'] - row['score_rejected']} |
| |
|
| | dataset = dataset.map(add_margin) |
| | ``` |
| |
|
| | |
| |
|
| | [[autodoc]] RewardConfig |
| |
|
| | |
| |
|
| | [[autodoc]] RewardTrainer |
| |
|