Update README.md
Browse files
README.md
CHANGED
|
@@ -3,6 +3,10 @@ license: llama3
|
|
| 3 |
---
|
| 4 |
# LLaMA3-iterative-DPO-final
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
## Introduction
|
| 7 |
We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
|
| 8 |
On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
|
|
|
|
| 3 |
---
|
| 4 |
# LLaMA3-iterative-DPO-final
|
| 5 |
|
| 6 |
+
* **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
|
| 7 |
+
* **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
|
| 8 |
+
* **Code**: https://github.com/RLHFlow/Online-RLHF
|
| 9 |
+
|
| 10 |
## Introduction
|
| 11 |
We release an unofficial checkpoint of a state-of-the-art instruct model of its class, **LLaMA3-iterative-DPO-final**.
|
| 12 |
On all three widely-used instruct model benchmarks: **Alpaca-Eval-V2**, **MT-Bench**, **Chat-Arena-Hard**, our model outperforms all models of similar size (e.g., LLaMA-3-8B-it), most large open-sourced models (e.g., Mixtral-8x7B-it),
|