|
|
--- |
|
|
|
|
|
datasets: DeepMath-103k |
|
|
library_name: transformers |
|
|
model_name: TACReward7B |
|
|
licence: license |
|
|
--- |
|
|
|
|
|
<h1 align= "center"> Reasoning-Aware Proxy Reward Model using Process Mining </h1> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea |
|
|
</p> |
|
|
<p align="center"> |
|
|
<a href="https://yongzzai.com/">Yongjae Lee</a><sup>*</sup>, <a href="https://thrillcrazyer.github.io/">Taekyhun Park</a><sup>*</sup> , Hyerim Bae<sup>†</sup> |
|
|
</p> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://github.com/Thrillcrazyer/TACReward"><b>🌟 Github</b></a> | |
|
|
<a href="https://huggingface.co/Thrillcrazyer/Qwen-1.5B_THIP"><b>📥 1.5B Download</b></a> | |
|
|
<a href="https://huggingface.co/Thrillcrazyer/TACReward7B"><b>📥 7B Download</b></a> | |
|
|
<a href="https://arxiv.org/abs/2510.25065"><b>📄 Arxiv Paper Link</b></a> | |
|
|
</p> |
|
|
|
|
|
# Abstract |
|
|
|
|
|
Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (LR) |
|
|
fine-tuning for post-training language models. However, for reasoning tasks such as mathematical problem solving, |
|
|
binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted |
|
|
to address this issue by estimating **overall** reasoning quality, it remains unclear whether these rewards are |
|
|
reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and |
|
|
propose **TACReward** reward model. The model can be seamlessly integrated into sparse reward frameworks without |
|
|
additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations |
|
|
between teachers and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$. |
|
|
Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward |
|
|
frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to |
|
|
consistent performance improvements over existing sparse reward frameworks. |
|
|
|
|
|
# Illustration of TACReward |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://arxiv.org/html/2510.25065v1/x1.png" width="600"/> |
|
|
</div> |