virtuoussy
/

Qwen2.5-7B-Instruct-RLVR

Safetensors

qwen2

Model card Files Files and versions

xet

Community

Improve language tag

by lbourdois - opened Apr 28, 2025

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+109

-97

Files changed (1) hide show

README.md +109 -97

README.md CHANGED Viewed

@@ -1,98 +1,110 @@
----
-license: apache-2.0
-datasets:
-- virtuoussy/Math-RLVR
-- virtuoussy/Multi-subject-RLVR
-language:
-- en
-base_model:
-- Qwen/Qwen2.5-7B-Instruct
----
-Model Details
-The generative reward model used in paper "Expanding RL with Verifiable Rewards Across Diverse Domains".
-Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
-## **Quick start**
-> ```python
-> # Load model directly
-> from transformers import AutoTokenizer, AutoModelForCausalLM
->
-> tokenizer = AutoTokenizer.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
-> model = AutoModelForCausalLM.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
->
-> PROMPT= '''
-> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
-> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
-> **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
->
-> Your task:
-> - Compare the final output of the solution process with the reference answer.
-> - If they **match exactly**, output **YES**.
-> - If they **do not match**, output **NO**.
-> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
->
-> Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
->
-> ---
->
-> **Question:**
-> {question}
->
-> **Solution Process (Final Step Only):**
-> {response}
->
-> **Reference Answer:**
-> {reference}
->
-> **Output:**
-> '''
->
->
-> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (　　)."
-> label="Chen Heqin"
-> answer="heqin chen"
->
-> prompt_question = PROMPT.format(question=question, reference=label, response=answer)
-> messages=[
->            {"role": "system", "content": "You are a helpful assistant."},
->            {"role": "user", "content": prompt_question},
->          ]
-> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
-> output=model.generate(input_ids,do_sample=False)
-> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
-> print("Model judgement: ",judgement)
-> ```
-## Use as a remote reward
-```bash
-# launch a remote reward
-bash launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
-# MODEL_PATH: the path of our generative reward model.
-# ANSWER_PATH: the path of the training data.
-# METRIC: greedy/prob
-# This will launch a reward at http://127.0.0.1:8000/get_reward
-# train
-bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
-# Both train.sh and launch_reward.sh can be found in the model directory.
-# We will release our github repo soon!
-```
-## Citation
-```bibtex
-@article{su2025expanding,
-  title={Expanding RL with Verifiable Rewards Across Diverse Domains},
-  author={Su, Yi and Yu, Dian and Song, Linfeng and Li, Juntao and Mi, Haitao and Tu, Zhaopeng and Zhang, Min and Yu, Dong},
-  journal={arXiv preprint arXiv:2503.23829},
-  year={2025}
-}
 ```

+---
+license: apache-2.0
+datasets:
+- virtuoussy/Math-RLVR
+- virtuoussy/Multi-subject-RLVR
+language:
+- zho
+- eng
+- fra
+- spa
+- por
+- deu
+- ita
+- rus
+- jpn
+- kor
+- vie
+- tha
+- ara
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---
+Model Details
+The generative reward model used in paper "Expanding RL with Verifiable Rewards Across Diverse Domains".
+Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
+## **Quick start**
+> ```python
+> # Load model directly
+> from transformers import AutoTokenizer, AutoModelForCausalLM
+>
+> tokenizer = AutoTokenizer.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
+> model = AutoModelForCausalLM.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
+>
+> PROMPT= '''
+> Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
+> The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
+> **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
+>
+> Your task:
+> - Compare the final output of the solution process with the reference answer.
+> - If they **match exactly**, output **YES**.
+> - If they **do not match**, output **NO**.
+> - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
+>
+> Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
+>
+> ---
+>
+> **Question:**
+> {question}
+>
+> **Solution Process (Final Step Only):**
+> {response}
+>
+> **Reference Answer:**
+> {reference}
+>
+> **Output:**
+> '''
+>
+>
+> question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (　　)."
+> label="Chen Heqin"
+> answer="heqin chen"
+>
+> prompt_question = PROMPT.format(question=question, reference=label, response=answer)
+> messages=[
+>            {"role": "system", "content": "You are a helpful assistant."},
+>            {"role": "user", "content": prompt_question},
+>          ]
+> input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
+> output=model.generate(input_ids,do_sample=False)
+> judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
+> print("Model judgement: ",judgement)
+> ```
+## Use as a remote reward
+```bash
+# launch a remote reward
+bash launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
+# MODEL_PATH: the path of our generative reward model.
+# ANSWER_PATH: the path of the training data.
+# METRIC: greedy/prob
+# This will launch a reward at http://127.0.0.1:8000/get_reward
+# train
+bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
+# Both train.sh and launch_reward.sh can be found in the model directory.
+# We will release our github repo soon!
+```
+## Citation
+```bibtex
+@article{su2025expanding,
+  title={Expanding RL with Verifiable Rewards Across Diverse Domains},
+  author={Su, Yi and Yu, Dian and Song, Linfeng and Li, Juntao and Mi, Haitao and Tu, Zhaopeng and Zhang, Min and Yu, Dong},
+  journal={arXiv preprint arXiv:2503.23829},
+  year={2025}
+}
 ```