li11111
/

Llama3-Instruct-8B-RSPO

 ---
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- pytorch
+- llama-3
+---
+## Model Details
+We employ **Llama3-Instruct (8B)** as one of the base models to evaluate our proposed **Reward-Driven Selective Penalization for Preference Alignment Optimization (RSPO)** method. The model is trained for **one epoch** on the **Llama3-UltraFeedback dataset** using **(RSPO)** method.
+## How to use
+#### Transformers AutoModelForCausalLM
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "li11111/Llama3-Instruct-8B-RSPO"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+messages = [
+    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
+    {"role": "user", "content": "Who are you?"},
+]
+input_ids = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to(model.device)
+terminators = [
+    tokenizer.eos_token_id,
+    tokenizer.convert_tokens_to_ids("<|eot_id|>")
+]
+outputs = model.generate(
+    input_ids,
+    max_new_tokens=256,
+    eos_token_id=terminators,
+    do_sample=True,
+    temperature=0.6,
+    top_p=0.9,
+)
+response = outputs[0][input_ids.shape[-1]:]
+print(tokenizer.decode(response, skip_special_tokens=True))
+```
+## Experiment Parameters
+| **Parameter**       | **Llama-3-Instruct** |
+| ------------------- | -------------------- |
+| `GPU`               | 8×Ascend910B         |
+| `beta`              | 0.01                 |
+| `batch`             | 128                  |
+| `learning_rate`     | 7e-7                 |
+| `max_prompt_length` | 512                  |
+| `max_length`        | 1024                 |
+| `num_train_epochs`  | 1                    |
+| `torch_dtype`       | `bfloat16`           |
+| `warmup_ratio`      | 0.1                  |
+| `β_w`               | 0.01                 |
+| `β_l`               | 0.1                  |
+| `λ`                 | 0.1                  |
+## Training Data
+We use the [princeton-nlp/llama3-ultrafeedback](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback) dataset created by [princeton-nlp team](https://huggingface.co/princeton-nlp) to train the Llama3 Instruct models.  The UltraFeedback dataset is used to provide prompts, and the chosen and rejected response pairs (yw, yl) are regenerated using the SFT models. For each prompt x, five responses are generated with the SFT model using a sampling temperature of 0.8. The responses are then scored using [llm-blender/PairRM](llm-blender/PairRM ) , with the highest-scoring response selected as yw and the lowest-scoring one as yl.
+## Benchmarks
+<table>
+    <tr>
+        <th>Method</th>
+        <th colspan="3" style="text-align: center;">AlpacaEval 2.0</th>
+    </tr>
+    <tr>
+        <th></th>
+        <th>LC</th>
+        <th>WR</th>
+        <th>Avg. Len</th>
+    </tr>
+    <tr>
+        <td><b>RSPO</b></td>
+        <td><b>45.0</b></td>
+        <td><b>42.5</b></td>
+        <td>1870</td>
+    </tr>
+</table>