File size: 8,565 Bytes
af20ddd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24557ea
e3b1221
 
23e5c65
e3b1221
 
 
 
 
 
 
 
 
4f58971
44b7197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24557ea
 
0722e05
24557ea
0722e05
24557ea
0722e05
 
 
af20ddd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0fab1c
 
 
af20ddd
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: mit
datasets:
- CodeGoat24/HPD
- CodeGoat24/LiFT-HRA
- CodeGoat24/OIP
- CodeGoat24/EvalMuse
- CodeGoat24/ShareGPTVideo-DPO
- CodeGoat24/VideoFeedback
- CodeGoat24/LLaVA-Critic-113k
- CodeGoat24/VideoDPO
base_model:
- lmms-lab/llava-onevision-qwen2-7b-ov
---


# Unified-Reward-7B-v1.5

## Model Summary

`Unified-Reward-7b-v1.5` is the enhanced version of [Unified-Reward-7b](https://huggingface.co/CodeGoat24/UnifiedReward-7b/blob/main/README.md) based on [LLaVA-OneVision-7b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov), the first unified reward model for multimodal understanding and generation assessment, enabling both pairwise ranking and pointwise scoring, which can be employed for vision model preference alignment.

For further details, please refer to the following resources:
- πŸ“° Paper: https://arxiv.org/pdf/2503.05236
- πŸͺ Project Page: https://codegoat24.github.io/UnifiedReward/
- πŸ€— Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
- πŸ€— Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
- πŸ‘‹ Point of Contact: [Yibin Wang](https://codegoat24.github.io)

# πŸ”₯ News
[2025/10/23] πŸ”₯πŸ”₯πŸ”₯ We release **UnifiedReward-Edit**-[[3b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-3b)/[7b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-7b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-32b)/[72b](https://huggingface.co/CodeGoat24/UnifiedReward-Edit-qwen-72b)], a unified reward model for **both Text-to-Image and Image-to-Image generation** trained on approximately 700K unified image generation and editing reward data!!
For image editing reward task, our models support:

>1. Pairwise Rank β€” directly judge which of two edited images is better.
>
>2. Pairwise Score β€” assign a separate score to each image in a pair.
>
>3. Pointwise Score β€” rate a single image on two axes: instruction-following and overall image quality.

πŸš€ The image editing reward inference code is available at [`UnifiedReward-Edit/`](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Edit) directory, while T2I inference code is unchanged from previous models. The editing training data is preprocessed from [EditScore](https://huggingface.co/datasets/EditScore/EditScore-Reward-Data) and [EditReward](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) and will be released soon. We sincerely appreciate all contributors!!

[2025/9/25] πŸ”₯πŸ”₯πŸ”₯ We release **UnifiedReward-2.0**-qwen-[[3b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-3b)/[7b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-7b)/[32b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-32b)/[72b](https://huggingface.co/CodeGoat24/UnifiedReward-2.0-qwen-72b)].
This version introduces several new capabilities:
>
>1. **Pairwise scoring** for image and video generation assessment on **_Alignment_**, **_Coherence_**, **_Style_** dimensions.
>
>2. **Pointwise scoring** for image and video generation assessment on **_Alignment_**, **_Coherence/Physics_**, **_Style_** dimensions. 
>
The added inference code is available at [`inference_qwen/UnifiedReward-2.0-inference`](https://github.com/CodeGoat24/UnifiedReward/tree/main/inference_qwen/UnifiedReward-2.0-inference) directory. The newly added training data has been released [here](https://huggingface.co/datasets/CodeGoat24/UnifiedReward-2.0-T2X-score-data) 😊.


[2025/4/16] πŸ”₯πŸ”₯ We updated the `UnifiedReward-7B-v1.5` by introducing pointwise scoring for generated images across three dimensions: alignment, coherence, and style, each rated on a continuous scale from 1 to 5.

1. **Alignment** quantifies how well an image matches its prompt.

2. **Coherence** assesses the logical consistency of the image and the absence of artifacts or visual glitches.

3. **Style** reflects the visual appeal of the image, independent of the prompt.

Welcome to download the latest version!

## 🏁 Compared with Current Reward Models

|  Reward Model | Method| Image Generation | Image Understanding | Video Generation | Video Understanding
| :-----: | :-----: |:-----: |:-----: | :-----: | :-----: |
|  [PickScore](https://github.com/yuvalkirstain/PickScore) |Point | √ |  | ||
|  [HPS](https://github.com/tgxs002/HPSv2) | Point | √ |  |||
|  [ImageReward](https://github.com/THUDM/ImageReward) |  Point| √|  |||
|  [LLaVA-Critic](https://huggingface.co/lmms-lab/llava-critic-7b) | Pair/Point | | √  |||
|  [IXC-2.5-Reward](https://github.com/InternLM/InternLM-XComposer) | Pair/Point | | √  ||√|
|  [VideoScore](https://github.com/TIGER-AI-Lab/VideoScore) | Point |  |  |√ ||
|  [LiFT](https://github.com/CodeGoat24/LiFT) | Point |  |  |√| |
|  [VisionReward](https://github.com/THUDM/VisionReward) | Point |√  | |√||
|  [VideoReward](https://github.com/KwaiVGI/VideoAlign) | Point |  |  |√ ||
|  UnifiedReward (Ours) | Pair/Point | √ | √ |√|√|


### Quick Start
All pair rank and point score inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward).

We take image understanding assessment as example here:
~~~python
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch

import sys
import warnings
import os


warnings.filterwarnings("ignore")
pretrained = "CodeGoat24/UnifiedReward-7b-v1.5"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

model.eval()

url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models

# pairwise ranking
critic_prompt = "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\nQuestion: [What this image presents?]\nThe first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\nThe second response: [This is a handwritten number seven.]\nASSISTANT:\n"

# pointwise scoring
# critic_prompt = "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of answer answers provided by a Large Multimodal Model (LMM). Score the response out of 100 and explain your reasoning with specific details. Your task is provided as follows:\nQuestion: [What this image presents?]\nThe LMM response: [This is a handwritten number seven.]\nASSISTANT:\n "

question = DEFAULT_IMAGE_TOKEN + "\n" + critic_prompt
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]


cont = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])
~~~


## Citation

```
@article{unifiedreward,
  title={Unified reward model for multimodal understanding and generation},
  author={Wang, Yibin and Zang, Yuhang and Li, Hao and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2503.05236},
  year={2025}
}
```