RichardErkhov commited on
Commit
69f6179
·
verified ·
1 Parent(s): 9bb2e17

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ pair-preference-model-LLaMA3-8B - bnb 4bits
11
+ - Model creator: https://huggingface.co/RLHFlow/
12
+ - Original model: https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: llama3
20
+ ---
21
+
22
+ * **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
23
+ * **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
24
+ * **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
25
+
26
+ This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
27
+
28
+ The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
29
+
30
+ ## Service the RM
31
+
32
+ Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
33
+
34
+ ```python
35
+ device = 0
36
+
37
+ model = AutoModelForCausalLM.from_pretrained(script_args.preference_name_or_path,
38
+ torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda()
39
+ tokenizer = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
40
+ tokenizer_plain = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
41
+ tokenizer_plain.chat_template = "\n{% for message in messages %}{% if loop.index0 % 2 == 0 %}\n\n<turn> user\n {{ message['content'] }}{% else %}\n\n<turn> assistant\n {{ message['content'] }}{% endif %}{% endfor %}\n\n\n"
42
+
43
+ prompt_template = "[CONTEXT] {context} [RESPONSE A] {response_A} [RESPONSE B] {response_B} \n"
44
+ token_id_A = tokenizer.encode("A", add_special_tokens=False)
45
+ token_id_B = tokenizer.encode("B", add_special_tokens=False)
46
+ assert len(token_id_A) == 1 and len(token_id_B) == 1
47
+ token_id_A = token_id_A[0]
48
+ token_id_B = token_id_B[0]
49
+ temperature = 1.0
50
+
51
+
52
+ model.eval()
53
+ response_chosen = "BBBB"
54
+ response_rejected = "CCCC"
55
+
56
+ ## We can also handle multi-turn conversation.
57
+ instruction = [{"role": "user", "content": ...},
58
+ {"role": "assistant", "content": ...},
59
+ {"role": "user", "content": ...},
60
+ ]
61
+ context = tokenizer_plain.apply_chat_template(instruction, tokenize=False)
62
+ responses = [response_chosen, response_rejected]
63
+ probs_chosen = []
64
+
65
+ for chosen_position in [0, 1]:
66
+ # we swap order to mitigate position bias
67
+ response_A = responses[chosen_position]
68
+ response_B = responses[1 - chosen_position]
69
+ prompt = prompt_template.format(context=context, response_A=response_A, response_B=response_B)
70
+ message = [
71
+ {"role": "user", "content": prompt},
72
+ ]
73
+
74
+ input_ids = tokenizer.encode(tokenizer.apply_chat_template(message, tokenize=False).replace(tokenizer.bos_token, ""), return_tensors='pt', add_special_tokens=False).cuda()
75
+
76
+ with torch.no_grad():
77
+ output = model(input_ids)
78
+ logit_A = output.logits[0, -1, token_id_A].item()
79
+ logit_B = output.logits[0, -1, token_id_B].item()
80
+ # take softmax to get the probability; using numpy
81
+ Z = np.exp(logit_A / temperature) + np.exp(logit_B / temperature)
82
+ logit_chosen = [logit_A, logit_B][chosen_position]
83
+ prob_chosen = np.exp(logit_chosen / temperature) / Z
84
+ probs_chosen.append(prob_chosen)
85
+
86
+ avg_prob_chosen = np.mean(probs_chosen)
87
+ correct = 0.5 if avg_prob_chosen == 0.5 else float(avg_prob_chosen > 0.5)
88
+ print(correct)
89
+ ```
90
+
91
+ ## Citation
92
+ If you use this model in your research, please consider citing our paper
93
+ ```
94
+ @misc{rlhflow,
95
+ title={RLHF Workflow: From Reward Modeling to Online RLHF},
96
+ author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
97
+ year={2024},
98
+ eprint={2405.07863},
99
+ archivePrefix={arXiv},
100
+ primaryClass={cs.LG}
101
+ }
102
+ ```
103
+ and Google's Slic paper (which initially proposes this pairwise preference model)
104
+ ```
105
+ @article{zhao2023slic,
106
+ title={Slic-hf: Sequence likelihood calibration with human feedback},
107
+ author={Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J},
108
+ journal={arXiv preprint arXiv:2305.10425},
109
+ year={2023}
110
+ }
111
+ ```
112
+