File size: 2,049 Bytes
ac5ccba
 
 
 
c1c9060
 
ac5ccba
c1c9060
 
 
 
ac5ccba
 
 
 
c1c9060
ac5ccba
 
 
c1c9060
 
 
 
 
 
ac5ccba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1c9060
ac5ccba
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
pipeline_tag: text-classification
---

# Released TraceLift Reason RMs

This directory contains two ready-to-load full Reward Model checkpoints introduced in the paper [Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards](https://huggingface.co/papers/2605.03862):

- `code-rm-full-ce`: code-domain Reason RM.
- `math-rm-full-ce`: math-domain Reason RM.

TraceLift is a planner-executor training framework that treats reasoning as a consumable intermediate artifact, using executor-grounded rewards to shape reasoning traces.

- **Code:** [GitHub Repository](https://github.com/MasaiahHan/TraceLift)
- **Paper:** [arXiv:2605.03862](https://huggingface.co/papers/2605.03862)

## Training details

Both checkpoints were initialized from `Qwen/Qwen2.5-7B-Instruct`, trained with LoRA, and then merged into full `Qwen2ForReasonRewardModel` weights.

- LoRA rank `32`, alpha `64`, dropout `0.05`.
- Five rubric classification heads with CE dimension loss.
- One total-score head with Huber loss on the normalized total score.
- The released checkpoints already include the backbone, rubric heads, and total head.

## Usage

To use these models, you need the custom `reasonrm` package from the [official repository](https://github.com/MasaiahHan/TraceLift).

```python
import torch
from transformers import AutoTokenizer

from reasonrm.modeling_reward import Qwen2ForReasonRewardModel

model = Qwen2ForReasonRewardModel.from_pretrained(
    "ScottHan/TraceLift", # or path to local subdir like math-rm-full-ce
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ScottHan/TraceLift")
```

## Citation

```bibtex
@misc{han2026correctisnotenough,
  title={Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards},
  author={Han, Tianyang and Shi, Hengyu and Hu, Junjie and Yang, Xu and Wang, Zhiling and Su, Junhao},
  year={2026},
  eprint={2605.03862},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.03862}
}
```