Update README.md
Browse files
README.md
CHANGED
|
@@ -24,4 +24,14 @@ Fine-tuning datasets for this model are based on [Stack Exchange Paired](https:/
|
|
| 24 |
|
| 25 |
**Traditional Fine-tuning:** [https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/finetune](https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/finetune)
|
| 26 |
|
| 27 |
-
**DPO Training:** [https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/rl](https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/rl)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
**Traditional Fine-tuning:** [https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/finetune](https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/finetune)
|
| 26 |
|
| 27 |
+
**DPO Training:** [https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/rl](https://huggingface.co/datasets/lvwerra/stack-exchange-paired/tree/main/data/rl)
|
| 28 |
+
|
| 29 |
+
### Training Procedure
|
| 30 |
+
The model was first fine-tuned on the Stack Exchange question and answer pairs and then fine-tuned via the DPO training procedure using a Stack Exchange Reward Model.
|
| 31 |
+
It is trained to respond to prompts with the following template:
|
| 32 |
+
|
| 33 |
+
```
|
| 34 |
+
Question: <Query>
|
| 35 |
+
|
| 36 |
+
Answer: <Response>
|
| 37 |
+
```
|