li11111 commited on
Commit
34ff8a6
verified
1 Parent(s): 0146914

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -3
README.md CHANGED
@@ -1,3 +1,117 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - pytorch
7
+ - Mistral
8
+ ---
9
+
10
+ ## Model Details
11
+
12
+ We employ **Mistral-Base(7B)** as one of the base models to evaluate our proposed **Reward-Driven Selective Penalization for Preference Alignment Optimization (RSPO)** method. The model is trained for **one epoch** on the **UltraFeedback Binarized dataset** using **(RSPO)** method.
13
+
14
+ ## How to use
15
+
16
+ #### Transformers AutoModelForCausalLM
17
+
18
+ ```python
19
+ from transformers import AutoTokenizer, AutoModelForCausalLM
20
+ import torch
21
+
22
+ model_id = "li11111/Mistral-7B-Base-RSPO"
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
25
+ model = AutoModelForCausalLM.from_pretrained(
26
+ model_id,
27
+ torch_dtype=torch.bfloat16,
28
+ device_map="auto",
29
+ )
30
+
31
+ messages = [
32
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
33
+ {"role": "user", "content": "Who are you?"},
34
+ ]
35
+
36
+ input_ids = tokenizer.apply_chat_template(
37
+ messages,
38
+ add_generation_prompt=True,
39
+ return_tensors="pt"
40
+ ).to(model.device)
41
+
42
+ terminators = [
43
+ tokenizer.eos_token_id,
44
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
45
+ ]
46
+
47
+ outputs = model.generate(
48
+ input_ids,
49
+ max_new_tokens=256,
50
+ eos_token_id=terminators,
51
+ do_sample=True,
52
+ temperature=0.6,
53
+ top_p=0.9,
54
+ )
55
+ response = outputs[0][input_ids.shape[-1]:]
56
+ print(tokenizer.decode(response, skip_special_tokens=True))
57
+ ```
58
+
59
+ ## Experiment Parameters
60
+
61
+ | **Parameter** | **Mistral-Base(7B)** |
62
+ | ------------------- | -------------------- |
63
+ | `GPU` | 8脳Ascend910B |
64
+ | `beta` | 0.01 |
65
+ | `batch` | 128 |
66
+ | `learning_rate` | 5e-7 |
67
+ | `max_prompt_length` | 512 |
68
+ | `max_length` | 1024 |
69
+ | `num_train_epochs` | 1 |
70
+ | `torch_dtype` | `bfloat16` |
71
+ | `warmup_ratio` | 0.1 |
72
+ | `尾_w` | 0.01 |
73
+ | `尾_l` | 0.1 |
74
+ | `位` | 0.1 |
75
+
76
+
77
+ ## Training Data
78
+
79
+ We use the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset to train the Mistral Base model.
80
+
81
+
82
+ ## Benchmarks
83
+
84
+ <table>
85
+ <tr>
86
+ <th>Method</th>
87
+ <th colspan="3" style="text-align: center;">AlpacaEval 2.0</th>
88
+ </tr>
89
+ <tr>
90
+ <th></th>
91
+ <th>LC</th>
92
+ <th>WR</th>
93
+ <th>Avg. Len</th>
94
+ </tr>
95
+ <tr>
96
+ <td><b>RSPO</b></td>
97
+ <td><b>25.4</b></td>
98
+ <td><b>23.7</b></td>
99
+ <td>1873</td>
100
+ </tr>
101
+ </table>
102
+
103
+
104
+
105
+ | **Method** | **GSM8K** | **ARC** | **TQA** | **MMLU** | **IFEval** | **Avg.** |
106
+ | ---------- | --------- | --------- | --------- | --------- | ---------- | --------- |
107
+ | **SFT** | **42.61** | 55.97 | 28.15 | 57.17 | 36.59 | 44.10 |
108
+ | **DPO** | 33.13 | 59.64 | 46.14 | 57.46 | 50.48 | 49.37 |
109
+ | **R-DPO** | 30.10 | 56.06 | 40.64 | 58.48 | 53.24 | 47.70 |
110
+ | **SimPO** | 33.59 | **60.15** | 43.45 | 58.25 | 52.98 | 49.68 |
111
+ | **WPO** | 30.63 | 57.00 | 40.51 | 58.54 | **55.64** | 48.46 |
112
+ | **RSPO** | 37.45 | 57.94 | **47.25** | **58.58** | 55.04 | **51.25** |
113
+
114
+
115
+
116
+
117
+