xinlai
/

DeepSeekMath-RL-Step-DPO

Text Generation

text-generation-inference

Model card Files Files and versions

xinlai commited on Jun 28, 2024

Commit

0134ded

·

verified ·

1 Parent(s): 6d4f1cf

Update README.md

Files changed (1) hide show

README.md +13 -3

README.md CHANGED Viewed

@@ -1,3 +1,13 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
+🖥️[Code](https://github.com/dvlab-research/Step-DPO) | 🤗[Data](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K) | 📄[Paper](https://arxiv.org/pdf/2406.18629)
+This repo contains the **DeepSeekMath-RL-Step-DPO** model for our paper **Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs**, **Step-DPO** is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of **70.8%** and **94.0%** on the test sets of **MATH** and **GSM8K** without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
+## Contact
+Please submit an issue [here](https://github.com/dvlab-research/Step-DPO) or send me an email [here](mailto:xinlai@cse.cuhk.edu.hk).