Thrillcrazyer
/

TACReward7B

Text Generation

text-generation-inference

Model card Files Files and versions

Thrillcrazyer commited on 23 days ago

Commit

f6594ae

·

verified ·

1 Parent(s): fe0be03

Update README.md

Files changed (1) hide show

README.md +13 -3

README.md CHANGED Viewed

@@ -2,11 +2,11 @@
 datasets: DeepMath-103k
 library_name: transformers
-model_name: Qwen-7B_THIP
 licence: license
 ---
-<h1 align= "center"> Reasoning-Aware GRPO using Process Mining </h1>
 <p align="center">
   <a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
@@ -29,7 +29,17 @@ licence: license
 # Abstract
-Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose **PM4GRPO**, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that **PM4GRPO** significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
 # Illustration of PM4GRPO

 datasets: DeepMath-103k
 library_name: transformers
+model_name: TACReward7B
 licence: license
 ---
+<h1 align= "center"> Reasoning-Aware Proxy Reward Model using Process Mining </h1>
 <p align="center">
   <a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
 # Abstract
+ Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (LR)
+  fine-tuning for post-training language models. However, for reasoning tasks such as mathematical problem solving,
+  binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted
+  to address this issue by estimating \textbf{overall} reasoning quality, it remains unclear whether these rewards are
+  reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and
+  propose \textbf{TACReward} reward model. The model can be seamlessly integrated into sparse reward frameworks without
+  additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations
+  between teachers and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$.
+  Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward
+  frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to
+  consistent performance improvements over existing sparse reward frameworks.
 # Illustration of PM4GRPO