Thrillcrazyer commited on
Commit
f6594ae
·
verified ·
1 Parent(s): fe0be03

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -3
README.md CHANGED
@@ -2,11 +2,11 @@
2
 
3
  datasets: DeepMath-103k
4
  library_name: transformers
5
- model_name: Qwen-7B_THIP
6
  licence: license
7
  ---
8
 
9
- <h1 align= "center"> Reasoning-Aware GRPO using Process Mining </h1>
10
 
11
  <p align="center">
12
  <a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
@@ -29,7 +29,17 @@ licence: license
29
 
30
  # Abstract
31
 
32
- Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose **PM4GRPO**, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that **PM4GRPO** significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
 
 
 
 
 
 
 
 
 
 
33
 
34
  # Illustration of PM4GRPO
35
 
 
2
 
3
  datasets: DeepMath-103k
4
  library_name: transformers
5
+ model_name: TACReward7B
6
  licence: license
7
  ---
8
 
9
+ <h1 align= "center"> Reasoning-Aware Proxy Reward Model using Process Mining </h1>
10
 
11
  <p align="center">
12
  <a href="https://pnubaelab.github.io/"><b>BAELAB</b></a>, Pusan National University, Busan, Korea
 
29
 
30
  # Abstract
31
 
32
+ Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (LR)
33
+ fine-tuning for post-training language models. However, for reasoning tasks such as mathematical problem solving,
34
+ binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted
35
+ to address this issue by estimating \textbf{overall} reasoning quality, it remains unclear whether these rewards are
36
+ reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and
37
+ propose \textbf{TACReward} reward model. The model can be seamlessly integrated into sparse reward frameworks without
38
+ additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations
39
+ between teachers and policy reasoning using process mining techniques, producing a scalar output reward range of $[0, 1]$.
40
+ Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward
41
+ frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to
42
+ consistent performance improvements over existing sparse reward frameworks.
43
 
44
  # Illustration of PM4GRPO
45