Model Card for Qwen3.0-1.7B-Reward
This model is a fine-tuned version of Qwen/Qwen3-0.6B. It has been trained using TRL. Using the https://huggingface.co/datasets/Anthropic/hh-rlhf Helpful only Dataset This preference model was trained using a chosen rejected dataset with supervised fine-tuning
Quick start
To learn how to make a ppo based RLHF model https://huggingface.co/docs/trl/v0.8.2/ppo_trainer
model = AutoModelForSequenceClassification.from_pretrained("Realmbird/helpfulness-preference-model-qwen-0.6B-merged", num_labels=1, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Realmbird/helpfulness-preference-model-qwen-0.6B-merged")
Training procedure
This model was trained with Reward.
Framework versions
- TRL: 0.20.0
- Transformers: 4.54.1
- Pytorch: 2.6.0+cu124
- Datasets: 4.0.0
- Tokenizers: 0.21.2
Citations
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 2