RLHFlow
/

pair-preference-model-LLaMA3-8B

Text Generation

text-generation-inference

Model card Files Files and versions

Haoxiang-Wang commited on May 17, 2024

Commit

a23d9a2

·

verified ·

1 Parent(s): e1d2459

Update README.md

Files changed (1) hide show

README.md +23 -0

README.md CHANGED Viewed

@@ -5,6 +5,7 @@ This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-
 The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
 ## Service the RM
@@ -62,4 +63,26 @@ for chosen_position in [0, 1]:
 avg_prob_chosen = np.mean(probs_chosen)
 correct = 0.5 if avg_prob_chosen == 0.5 else float(avg_prob_chosen > 0.5)
 print(correct)
 ```

 The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
+See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.
 ## Service the RM
 avg_prob_chosen = np.mean(probs_chosen)
 correct = 0.5 if avg_prob_chosen == 0.5 else float(avg_prob_chosen > 0.5)
 print(correct)
+```
+## Citation
+If you use this model in your research, please consider citing our paper
+```
+@misc{rlhflow,
+      title={RLHF Workflow: From Reward Modeling to Online RLHF},
+      author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
+      year={2024},
+      eprint={2405.07863},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```
+and Google's Slic paper (which initially proposes this pairwise preference model)
+```
+@article{zhao2023slic,
+  title={Slic-hf: Sequence likelihood calibration with human feedback},
+  author={Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J},
+  journal={arXiv preprint arXiv:2305.10425},
+  year={2023}
+}
 ```