|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
This is the Thinking Reward Model of SophiaVL-R1 (https://arxiv.org/abs/2505.17018). |
|
|
|
|
|
The code for SophiaVL-R1 can be found at https://github.com/kxfan2002/SophiaVL-R1. |
|
|
This model is finetuned with the [SophiaVL-R1-Thinking-156k Dataset](https://huggingface.co/datasets/bunny127/SophiaVL-R1-Thinking-156k). The base model is Qwen2.5-VL-3B. |
|
|
|
|
|
The input of Thinking Reward Model is a question with model response. Thinking Reward Model will output a score between 0 and 1 indicating the thinking quality of model response. |
|
|
|
|
|
We provide a command to deploy the Thinking Reward Model using vLLM: |
|
|
```bash |
|
|
python3 -m vllm.entrypoints.openai.api_server --port 80 --model /path/to/thinking/reward/model --served-model-name thinking-reward-model --tensor-parallel-size 2 --max-num-seqs 64 --max_model_len=32768 |
|
|
``` |
|
|
|
|
|
We provide a script to query the deployed model for thinking reward: |
|
|
```python |
|
|
import httpx |
|
|
import time |
|
|
import base64 |
|
|
from pathlib import Path |
|
|
|
|
|
openai_api_base = "vllm-url" |
|
|
reward_model = "thinking-reward-model" |
|
|
question = "your question" |
|
|
image = "your image path" |
|
|
answer = "your model response" |
|
|
|
|
|
def encode_image_base64(image_path): |
|
|
with open(image_path, "rb") as image_file: |
|
|
return base64.b64encode(image_file.read()).decode("utf-8") |
|
|
|
|
|
def get_process_reward(prompt_str, reasoning_str, image_path=None): |
|
|
image_base64 = None |
|
|
if image_path is not None: |
|
|
image_base64 = encode_image_base64(image_path) |
|
|
if "<image>" not in prompt_str: |
|
|
prompt_str = f"<image> {prompt_str}" |
|
|
|
|
|
prompt = f"""You are an expert reasoning evaluator. I will give you a multimodal question and an answer. Your goal is to judge a reward process and give a score between 0 and 1. You should focus on whether the reasoning process is good rather than whether the final answer is correct.### Evaluation Criteria: |
|
|
- **Logical Soundness**: Does each step follow logically from the previous one? |
|
|
- **Correct Reasoning**: Are the methods and steps used appropriate and valid? Are the facts and lemmas correctly stated and applied? |
|
|
- **Error Identification**: Are there any logical fallacies, unsupported assumptions, or incorrect steps? |
|
|
- **Language Consistency**: Is the reasoning process conducted in a single, consistent language without mixing different languages? |
|
|
- **Redundancy**: Is the reasoning concise, without unnecessary repetition or extraneous steps? |
|
|
Provide a single score from **{{0, 0.1, 0.2, ..., 1.0}}** based on the reasoning quality, where: |
|
|
- **0**: Completely flawed reasoning |
|
|
- **1**: Perfectly sound reasoning |
|
|
- Intermediate values (e.g., 0.3, 0.7) should reflect partial correctness or minor errors. |
|
|
Be strict, reward the good process and punish the bad one. You should only output the score without any explanation. |
|
|
Question: {prompt_str} |
|
|
Reasoning process: {reasoning_str} |
|
|
""" |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": [{"type": "text", "text": prompt}]}, |
|
|
] |
|
|
|
|
|
if image_base64 is not None: |
|
|
messages[1]["content"].append({ |
|
|
"type": "image_url", |
|
|
"image_url": {"url": f"data:image/png;base64,{image_base64}"}, |
|
|
}) |
|
|
|
|
|
payload = { |
|
|
"model": reward_model, |
|
|
"messages": messages, |
|
|
"temperature": 0.0, |
|
|
} |
|
|
|
|
|
attempt = 0 |
|
|
max_retry = 10 |
|
|
while attempt < max_retry: |
|
|
try: |
|
|
response = httpx.post(openai_api_base, headers={"Content-Type": "application/json"}, json=payload, timeout=60) |
|
|
response.raise_for_status() |
|
|
result = response.json()["choices"][0]["message"]["content"] |
|
|
print(result) |
|
|
return 0 |
|
|
except Exception as e: |
|
|
print(f"[Attempt {attempt+1}] get_process_reward failed: {e}, message: {prompt_str}") |
|
|
attempt += 1 |
|
|
time.sleep(1) |
|
|
return 0 |
|
|
|
|
|
get_process_reward(qestion,answer,image) |
|
|
``` |