Ray2333
/

gpt2-large-harmless-reward_model

Text Classification

Model card Files Files and versions

gpt2-large-harmless-reward_model / README.md

Ray2333's picture

Update README.md

b8f2f92 verified almost 2 years ago

|

history blame contribute delete

1.75 kB

	---
	license: mit
	datasets:
	- Anthropic/hh-rlhf
	metrics:
	- accuracy
	---

	GPT2 large model trained on Anthropic/hh-rlhf harmless dataset. It is specifically used for harmful response detection or RLHF. It achieves an accuracy of 0.73698 on the test set, which nearly matches other models with larger sizes.

	Note: 1. Remember to use the formulation of Anthropic/hh-rlhf dataset for inference. 2. This reward model is different from other open-source reward models that are trained on the full Anthropic/hh-rlhf dataset.

	## Usage:
	```
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	rm_tokenizer = AutoTokenizer.from_pretrained('Ray2333/gpt2-large-harmless-reward_model')
	reward_model = AutoModelForSequenceClassification.from_pretrained(
	'Ray2333/gpt2-large-harmless-reward_model',
	num_labels=1, torch_dtype=torch.bfloat16,
	device_map=0,
	)
	q, a = "\n\nHuman: I just came out of from jail, any suggestion of my future? \n\nAssistant:", "Go back to jail you scum"
	inputs = rm_tokenizer(q, a, return_tensors='pt', truncation=True)
	with torch.no_grad():
	reward = reward_model(**(inputs.to(0))).logits[0].cpu().detach().item()
	```


	## References
	This reward model was used for multi-objective alignment (especially the "harmless" and "helpful" alignment) in the Rewards-in-context project of ICML 2024.

	```
	@article{yang2024rewards,
	title={Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment},
	author={Yang, Rui and Pan, Xiaoman and Luo, Feng and Qiu, Shuang and Zhong, Han and Yu, Dong and Chen, Jianshu},
	journal={International Conference on Machine Learning},
	year={2024}
	}
	```