File size: 5,850 Bytes
22af176 06cf553 22af176 06cf553 22af176 373d46a 22af176 06cf553 22af176 06cf553 22af176 373d46a 22af176 373d46a 22af176 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
base_model:
- CodeGoat24/UnifiedReward-7b
datasets:
- CodeGoat24/HPD
- CodeGoat24/OIP
- CodeGoat24/EvalMuse
- CodeGoat24/ShareGPTVideo-DPO
- CodeGoat24/LLaVA-Critic-113k
- CodeGoat24/VideoDPO
- CodeGoat24/Text-2-Video-Human-Preferences
- CodeGoat24/OpenAI-4o_t2i_human_preference
- CodeGoat24/ImageGen_Reward_Cold_Start
license: mit
library_name: transformers
pipeline_tag: image-text-to-text
---
## Model Summary
`Unified-Reward-Think-7b` is the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.
For further details, please refer to the following resources:
- 📰 Paper: https://arxiv.org/pdf/2505.03318
- 🪐 Project Page: https://codegoat24.github.io/UnifiedReward/think
- 🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
- 🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
- 👋 Point of Contact: [Yibin Wang](https://codegoat24.github.io)
### Quick Start
All inference codes are provided in our [github](https://github.com/CodeGoat24/UnifiedReward/tree/main/UnifiedReward-Think).
We take image understanding assessment as example here:
~~~python
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
import os
warnings.filterwarnings("ignore")
pretrained = "CodeGoat24/UnifiedReward-Think-7b"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
Query = 'What does this image present?'
R1 = 'The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.'
R2 = 'This is a handwritten number seven.'
question = ("<image>
Given a question and a reference image, please analyze in detail the two provided answers (Answer 1 and Answer 2). " \
"Evaluate them based on the following three core dimensions:
" \
"1. Semantic accuracy: How well the answer reflects the visual content of the image
" \
"2. Correctness: Whether the answer is logically and factually correct
" \
"3. Clarity: Whether the answer is clearly and fluently expressed
" \
"You may also consider additional dimensions if you find them relevant (e.g., reasoning ability, attention to detail, multimodal grounding, etc.). " \
"For each dimension, provide a score from 1 to 10 for both answers, and briefly explain your reasoning. " \
"Then, compute the total score for each answer by explicitly adding the scores for all dimensions and showing the full calculation. " \
"Enclose your full reasoning within <think> and </think> tags. " \
"Then, in the <answer> tag, output exactly one of the following: 'Answer 1 is better' or 'Answer 2 is better'. No other text is allowed in the <answer> section.
" \
"Example format:
" \
"<think>
" \
"1. Semantic accuracy: Answer 1 (9/10) - ...; Answer 2 (7/10) - ...
" \
"2. Correctness: Answer 1 (8/10) - ...; Answer 2 (7/10) - ...
" \
"3. Clarity: Answer 1 (9/10) - ...; Answer 2 (8/10) - ...
" \
"[Additional dimensions if any]: Answer 1 (6/10) - ...; Answer 2 (7/10) - ...
" \
"Total score:
Answer 1: 9+8+9+6=32
Answer 2: 7+7+8+7=29
" \
"</think>
" \
"<answer>Answer 1 is better</answer>
" \
"**Note: In the example above, scores and the final answer are placeholders meant only to demonstrate the format. Your actual evaluation should be based on the quality of two given answers.**
"
f"Your task is provided as follows:
Question: [{Query}]
Answer 1: [{R1}]
Answer 2: [{R2}]")
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]
cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])
~~~
## Citation
```
@article{UnifiedReward-Think,
title={Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning.},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin, and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2505.03318},
year={2025}
}
``` |