li11111 commited on
Commit
8f239cc
verified
1 Parent(s): 885f26a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -6
README.md CHANGED
@@ -1,7 +1,111 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - princeton-nlp/llama3-ultrafeedback
5
- base_model:
6
- - meta-llama/Meta-Llama-3-8B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - pytorch
7
+ - llama-3
8
+ ---
9
+
10
+ ## Model Details
11
+
12
+ We employ **Llama3-Instruct (8B)** as one of the base models to evaluate our proposed **Reward-Driven Selective Penalization for Preference Alignment Optimization (RSPO)** method. The model is trained for **one epoch** on the **Llama3-UltraFeedback dataset** using **(RSPO)** method.
13
+
14
+ ## How to use
15
+
16
+ #### Transformers AutoModelForCausalLM
17
+
18
+ ```python
19
+ from transformers import AutoTokenizer, AutoModelForCausalLM
20
+ import torch
21
+
22
+ model_id = "li11111/Llama3-Instruct-8B-RSPO"
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
25
+ model = AutoModelForCausalLM.from_pretrained(
26
+ model_id,
27
+ torch_dtype=torch.bfloat16,
28
+ device_map="auto",
29
+ )
30
+
31
+ messages = [
32
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
33
+ {"role": "user", "content": "Who are you?"},
34
+ ]
35
+
36
+ input_ids = tokenizer.apply_chat_template(
37
+ messages,
38
+ add_generation_prompt=True,
39
+ return_tensors="pt"
40
+ ).to(model.device)
41
+
42
+ terminators = [
43
+ tokenizer.eos_token_id,
44
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
45
+ ]
46
+
47
+ outputs = model.generate(
48
+ input_ids,
49
+ max_new_tokens=256,
50
+ eos_token_id=terminators,
51
+ do_sample=True,
52
+ temperature=0.6,
53
+ top_p=0.9,
54
+ )
55
+ response = outputs[0][input_ids.shape[-1]:]
56
+ print(tokenizer.decode(response, skip_special_tokens=True))
57
+ ```
58
+
59
+ ## Experiment Parameters
60
+
61
+ | **Parameter** | **Llama-3-Instruct** |
62
+ | ------------------- | -------------------- |
63
+ | `GPU` | 8脳Ascend910B |
64
+ | `beta` | 0.01 |
65
+ | `batch` | 128 |
66
+ | `learning_rate` | 7e-7 |
67
+ | `max_prompt_length` | 512 |
68
+ | `max_length` | 1024 |
69
+ | `num_train_epochs` | 1 |
70
+ | `torch_dtype` | `bfloat16` |
71
+ | `warmup_ratio` | 0.1 |
72
+ | `尾_w` | 0.01 |
73
+ | `尾_l` | 0.1 |
74
+ | `位` | 0.1 |
75
+
76
+
77
+ ## Training Data
78
+
79
+ We use the [princeton-nlp/llama3-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback) dataset created by [princeton-nlp team](https://huggingface.co/princeton-nlp) to train the Llama3 Instruct models. The UltraFeedback dataset is used to provide prompts, and the chosen and rejected response pairs (yw, yl) are regenerated using the SFT models. For each prompt x, five responses are generated with the SFT model using a sampling temperature of 0.8. The responses are then scored using [llm-blender/PairRM](llm-blender/PairRM ) , with the highest-scoring response selected as yw and the lowest-scoring one as yl.
80
+
81
+
82
+ ## Benchmarks
83
+
84
+ <table>
85
+ <tr>
86
+ <th>Method</th>
87
+ <th colspan="3" style="text-align: center;">AlpacaEval 2.0</th>
88
+ </tr>
89
+ <tr>
90
+ <th></th>
91
+ <th>LC</th>
92
+ <th>WR</th>
93
+ <th>Avg. Len</th>
94
+ </tr>
95
+ <tr>
96
+ <td><b>RSPO</b></td>
97
+ <td><b>45.0</b></td>
98
+ <td><b>42.5</b></td>
99
+ <td>1870</td>
100
+ </tr>
101
+ </table>
102
+
103
+
104
+
105
+
106
+
107
+
108
+
109
+
110
+
111
+