Update README.md
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ licence: license
|
|
| 12 |
<a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
|
| 13 |
</p>
|
| 14 |
<p align="center">
|
| 15 |
-
|
| 16 |
</p>
|
| 17 |
|
| 18 |
|
|
@@ -32,9 +32,9 @@ licence: license
|
|
| 32 |
Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (LR)
|
| 33 |
fine-tuning for post-training language models. However, for reasoning tasks such as mathematical problem solving,
|
| 34 |
binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted
|
| 35 |
-
to address this issue by estimating
|
| 36 |
reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and
|
| 37 |
-
propose
|
| 38 |
additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations
|
| 39 |
between teachers and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$.
|
| 40 |
Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward
|
|
|
|
| 12 |
<a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
|
| 13 |
</p>
|
| 14 |
<p align="center">
|
| 15 |
+
Yongjae Lee<sup>*</sup>, Taekyhun Park<sup>*</sup> , Hyerim Bae<sup>†</sup>
|
| 16 |
</p>
|
| 17 |
|
| 18 |
|
|
|
|
| 32 |
Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (LR)
|
| 33 |
fine-tuning for post-training language models. However, for reasoning tasks such as mathematical problem solving,
|
| 34 |
binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted
|
| 35 |
+
to address this issue by estimating **overall** reasoning quality, it remains unclear whether these rewards are
|
| 36 |
reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and
|
| 37 |
+
propose **TACReward** reward model. The model can be seamlessly integrated into sparse reward frameworks without
|
| 38 |
additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations
|
| 39 |
between teachers and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$.
|
| 40 |
Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward
|