File size: 8,019 Bytes
4407857
 
d32075e
 
4407857
 
 
 
 
 
d32075e
 
4407857
d32075e
4407857
 
 
d32075e
 
4407857
d32075e
4407857
 
d32075e
 
 
4407857
 
 
 
d32075e
4407857
d32075e
4407857
 
 
d32075e
 
 
 
 
 
4407857
d32075e
4407857
 
 
d32075e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4407857
 
 
 
 
d32075e
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
library_name: transformers
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

[Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
](https://arxiv.org/abs/2505.18601)

**Flex-VL-7B** is a vision-language model developed as part of the Flex-Judge framework, designed to perform robust evaluation of multimodal content using primarily text-only reasoning. Despite being trained with minimal supervision, it generalizes effectively to complex image- and video-based evaluation tasks, enabling consistent and interpretable judgments across diverse multimodal inputs.

### Model Description

- We propose **Flex-Judge**, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats.
- Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge.

### Model Sources

<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/jongwooko/flex-judge
- **Paper:** [Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
](https://arxiv.org/abs/2505.18601)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
For more comprehensive usage examples and implementation details, please refer to our official repository.

### Requirements

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

```
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8
pip install vllm
pip install datasets
```

### Using 🤗 Transformers to Chat

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

Here we show a conde snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:

```
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from datasets import load_dataset

import torch


# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "jongwooko/Flex-VL-7B", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "jongwooko/Flex-VL-7B",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("jongwooko/Flex-VL-7B")

# Example
example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
question, image = example["query"], example["image"]
answer1, answer2 = example["response"]

# System prompt for Flex-Judge
SYSTEM_PROMPT = (
    "You are a helpful assistant. The assistant first performs a detailed, "
    "step-by-step reasoning process in its mind and then provides the user with"
    "the answer. The reasoning process and answer are enclosed within <think> "
    "reasoning process here, explaining each step of your evaluation for both "
    "assistants </think><answer> answer here </answer>. Now the user asks you "
    "to judge the performance of two AI assistants in response to the question. "
    "Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
    "relevance, accuracy, and level of detail. Avoid order, length, style or "
    "other bias. After thinking, when you finally reach a conclusion, clearly "
    "provide your evaluation scores within <answer> </answer> tags, i.e., for "
    "example, <answer>3</answer><answer>5</answer>"
)

messages = [
    {
        "role": "system", "content": SYSTEM_PROMPT
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": "[Question]\n{question}\n\n[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"},
        ]
    },
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text+"\n<think>\n\n"],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

### Using vLLM

Here, we recommend using `vllm` instead of `transformers` to improve inference speed. The results in our papers are based on the `vllm` library.

```
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from datasets import load_dataset
from vllm import LLM, SamplingParams

# default: Load the model on the available device(s)
llm = LLM(
    "jongwooko/Flex-VL-7B",
    tensor_parallel_size=4,
    limit_mm_per_prompt={"image": 1},  # The maximum number to accept
)
sampling_params = SamplingParams(
    max_tokens=4096,
    temperature=0.2,
    top_p=0.95,
)

# default processer
processor = AutoProcessor.from_pretrained("jongwooko/Flex-VL-7B", use_fast=True)

# Example
example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
question, image = example["query"], example["image"]
answer1, answer2 = example["response"]

# System prompt for Flex-Judge
SYSTEM_PROMPT = (
    "You are a helpful assistant. The assistant first performs a detailed, "
    "step-by-step reasoning process in its mind and then provides the user with"
    "the answer. The reasoning process and answer are enclosed within <think> "
    "reasoning process here, explaining each step of your evaluation for both "
    "assistants </think><answer> answer here </answer>. Now the user asks you "
    "to judge the performance of two AI assistants in response to the question. "
    "Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
    "relevance, accuracy, and level of detail. Avoid order, length, style or "
    "other bias. After thinking, when you finally reach a conclusion, clearly "
    "provide your evaluation scores within <answer> </answer> tags, i.e., for "
    "example, <answer>3</answer><answer>5</answer>"
)

messages = [
    {
        "role": "system", "content": SYSTEM_PROMPT
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "<|vision_start|><|image_pad|><|vision_end|>\n\n[Question]\n{question}\n\n[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"},
        ]
    },
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = {"prompt": text, "multi_modal_data": {"image": [image]}}

# Inference: Generation of the output
outputs = llm.generate([inputs], sampling_params=sampling_params)
output_text = outputs[0].outputs[0].text
print (output_text)
```

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```
@article{ko2025flex,
  title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
  author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
  journal={arXiv preprint arXiv:2505.18601},
  year={2025}
}
```