Update README.md
Browse files
README.md
CHANGED
|
@@ -2,11 +2,11 @@
|
|
| 2 |
|
| 3 |
datasets: DeepMath-103k
|
| 4 |
library_name: transformers
|
| 5 |
-
model_name:
|
| 6 |
licence: license
|
| 7 |
---
|
| 8 |
|
| 9 |
-
<h1 align= "center"> Reasoning-Aware
|
| 10 |
|
| 11 |
<p align="center">
|
| 12 |
<a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
|
|
@@ -29,7 +29,17 @@ licence: license
|
|
| 29 |
|
| 30 |
# Abstract
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
# Illustration of PM4GRPO
|
| 35 |
|
|
|
|
| 2 |
|
| 3 |
datasets: DeepMath-103k
|
| 4 |
library_name: transformers
|
| 5 |
+
model_name: TACReward7B
|
| 6 |
licence: license
|
| 7 |
---
|
| 8 |
|
| 9 |
+
<h1 align= "center"> Reasoning-Aware Proxy Reward Model using Process Mining </h1>
|
| 10 |
|
| 11 |
<p align="center">
|
| 12 |
<a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
|
|
|
|
| 29 |
|
| 30 |
# Abstract
|
| 31 |
|
| 32 |
+
Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (LR)
|
| 33 |
+
fine-tuning for post-training language models. However, for reasoning tasks such as mathematical problem solving,
|
| 34 |
+
binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted
|
| 35 |
+
to address this issue by estimating \textbf{overall} reasoning quality, it remains unclear whether these rewards are
|
| 36 |
+
reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and
|
| 37 |
+
propose \textbf{TACReward} reward model. The model can be seamlessly integrated into sparse reward frameworks without
|
| 38 |
+
additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations
|
| 39 |
+
between teachers and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$.
|
| 40 |
+
Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward
|
| 41 |
+
frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to
|
| 42 |
+
consistent performance improvements over existing sparse reward frameworks.
|
| 43 |
|
| 44 |
# Illustration of PM4GRPO
|
| 45 |
|