File size: 5,850 Bytes
22af176
06cf553
 
22af176
 
 
 
 
 
 
 
 
 
06cf553
 
 
22af176
 
 
 
 
 
 
373d46a
22af176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06cf553
 
 
 
 
 
 
 
 
 
22af176
 
 
 
06cf553
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22af176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
373d46a
22af176
373d46a
 
22af176
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
base_model:
- CodeGoat24/UnifiedReward-7b
datasets:
- CodeGoat24/HPD
- CodeGoat24/OIP
- CodeGoat24/EvalMuse
- CodeGoat24/ShareGPTVideo-DPO
- CodeGoat24/LLaVA-Critic-113k
- CodeGoat24/VideoDPO
- CodeGoat24/Text-2-Video-Human-Preferences
- CodeGoat24/OpenAI-4o_t2i_human_preference
- CodeGoat24/ImageGen_Reward_Cold_Start
license: mit
library_name: transformers
pipeline_tag: image-text-to-text
---

## Model Summary

`Unified-Reward-Think-7b` is the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.

For further details, please refer to the following resources:
- 📰 Paper: https://arxiv.org/pdf/2505.03318
- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/think
- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
- 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
- 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)

### Quick Start
All inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).

We take image understanding assessment as example here:
~~~python
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch

import sys
import warnings
import os


warnings.filterwarnings("ignore")
pretrained = "CodeGoat24/UnifiedReward-Think-7b"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
Query = 'What does this image present?'
R1 = 'The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.'
R2 = 'This is a handwritten number seven.'

question = ("<image>
Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
            "Evaluate them based on the following three core dimensions:
" \
            "1. Semantic accuracy: How well the answer reflects the visual content of the image
" \
            "2. Correctness: Whether the answer is logically and factually correct
" \
            "3. Clarity: Whether the answer is clearly and fluently expressed
" \
            "You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
            "For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
            "Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
            "Enclose your full reasoning within <think> and </think> tags. " \
            "Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section.

" \
            "Example format:
" \
            "<think>
" \
            "1. Semantic accuracy: Answer 1 (9/10) - ...; Answer 2 (7/10) - ...
" \
            "2. Correctness: Answer 1 (8/10) - ...; Answer 2 (7/10) - ...
" \
            "3. Clarity: Answer 1 (9/10) - ...; Answer 2 (8/10) - ...
" \
            "[Additional dimensions if any]: Answer 1 (6/10) - ...; Answer 2 (7/10) - ...
" \
            "Total score:
Answer 1: 9+8+9+6=32
Answer 2: 7+7+8+7=29
" \
            "</think>
" \
            "<answer>Answer 1 is better</answer>

" \
            "**Note: In the example above, scores and the final answer are placeholders meant only to demonstrate the format. Your actual evaluation should be based on the quality of two given answers.**

"
            f"Your task is provided as follows:
Question: [{Query}]
Answer 1: [{R1}]
Answer 2: [{R2}]")
            
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]


cont = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])
~~~


## Citation

```
@article{UnifiedReward-Think,
  title={Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning.},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2505.03318},
  year={2025}
}
```