Update README.md
Browse files
README.md
CHANGED
|
@@ -17,6 +17,7 @@
|
|
| 17 |
To overcome these limitations, we propose **SuperCorrect**, a novel two-stage framework that uses a large teacher model to *supervise* and *correct* both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems.
|
| 18 |
Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our **SuperCorrect-7B** model significantly **surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3%** on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models.
|
| 19 |
</details>
|
|
|
|
| 20 |
## Introduction
|
| 21 |
|
| 22 |

|
|
|
|
| 17 |
To overcome these limitations, we propose **SuperCorrect**, a novel two-stage framework that uses a large teacher model to *supervise* and *correct* both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems.
|
| 18 |
Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our **SuperCorrect-7B** model significantly **surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3%** on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models.
|
| 19 |
</details>
|
| 20 |
+
|
| 21 |
## Introduction
|
| 22 |
|
| 23 |

|