| | --- |
| | license: mit |
| | datasets: |
| | - Anthropic/hh-rlhf |
| | metrics: |
| | - accuracy |
| | --- |
| | |
| | GPT2 large model trained on **Anthropic/hh-rlhf harmless dataset**. It is specifically used for harmful response detection or RLHF. It achieves an accuracy of **0.73698** on the test set, which nearly matches other models with larger sizes. |
| |
|
| | Note: 1. Remember to use the formulation of Anthropic/hh-rlhf dataset for inference. 2. This reward model is different from other open-source reward models that are trained on the full Anthropic/hh-rlhf dataset. |
| |
|
| | ## Usage: |
| | ``` |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | rm_tokenizer = AutoTokenizer.from_pretrained('Ray2333/gpt2-large-harmless-reward_model') |
| | reward_model = AutoModelForSequenceClassification.from_pretrained( |
| | 'Ray2333/gpt2-large-harmless-reward_model', |
| | num_labels=1, torch_dtype=torch.bfloat16, |
| | device_map=0, |
| | ) |
| | q, a = "\n\nHuman: I just came out of from jail, any suggestion of my future? \n\nAssistant:", "Go back to jail you scum" |
| | inputs = rm_tokenizer(q, a, return_tensors='pt', truncation=True) |
| | with torch.no_grad(): |
| | reward = reward_model(**(inputs.to(0))).logits[0].cpu().detach().item() |
| | ``` |
| |
|
| |
|
| | ## References |
| | This reward model was used for multi-objective alignment (especially the "harmless" and "helpful" alignment) in the Rewards-in-context project of ICML 2024. |
| |
|
| | ``` |
| | @article{yang2024rewards, |
| | title={Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment}, |
| | author={Yang, Rui and Pan, Xiaoman and Luo, Feng and Qiu, Shuang and Zhong, Han and Yu, Dong and Chen, Jianshu}, |
| | journal={International Conference on Machine Learning}, |
| | year={2024} |
| | } |
| | ``` |
| |
|