Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
|
| 12 |
# Binary-Think-RM-3B
|
| 13 |
|
| 14 |
-
Binary-Think-RM-3B is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://
|
| 15 |
|
| 16 |
This model is fine-tuned from [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-binary-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-binary-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-binary-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-binary-max).
|
| 17 |
|
|
@@ -76,7 +76,7 @@ message = tokenizer.apply_chat_template(
|
|
| 76 |
|
| 77 |
## Performance
|
| 78 |
|
| 79 |
-
For detailed performance metrics on RewardBench, RM-Bench, HelpSteer2-Preference, and HelpSteer3-Preference, please refer to Tables 5, 6, and 7 in the paper: [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://
|
| 80 |
|
| 81 |
## Citation
|
| 82 |
|
|
|
|
| 11 |
|
| 12 |
# Binary-Think-RM-3B
|
| 13 |
|
| 14 |
+
Binary-Think-RM-3B is a generative reward model with long-horizon reasoning capabilities, introduced in the paper [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://openreview.net/pdf?id=UfQAFbP6xq).
|
| 15 |
|
| 16 |
This model is fine-tuned from [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) using a two-stage training process: (1) reasoning-oriented supervised fine-tuning (SFT) using [ilgee/hs2-naive-reasoning-binary-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-binary-max) and (2) reinforcement learning with verifiable rewards (RLVR) using a prompt part of [ilgee/hs2-naive-reasoning-binary-max](https://huggingface.co/datasets/ilgee/hs2-naive-reasoning-binary-max).
|
| 17 |
|
|
|
|
| 76 |
|
| 77 |
## Performance
|
| 78 |
|
| 79 |
+
For detailed performance metrics on RewardBench, RM-Bench, HelpSteer2-Preference, and HelpSteer3-Preference, please refer to Tables 5, 6, and 7 in the paper: [Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models](https://openreview.net/pdf?id=UfQAFbP6xq)
|
| 80 |
|
| 81 |
## Citation
|
| 82 |
|