| | --- |
| | library_name: transformers |
| | base_model: |
| | - Qwen/Qwen2.5-Omni-7B |
| | --- |
| | |
| | # Model Card for Model ID |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| |
|
| | [Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators |
| | ](https://arxiv.org/abs/2505.18601) |
| |
|
| | **Flex‑Omni‑7B** is an 11B-parameter multimodal evaluator capable of handling not only vision-language tasks but also audio-based evaluations—something traditional VL models cannot do. It inherits the reasoning-by-text paradigm from Flex‑Judge, enabling strong performance across modalities, and even outperforms models like Gemini‑2.0‑Flash on audio benchmarks such as MOS and speech scoring. Unlike vision-language models, Flex‑Omni‑7B unifies vision, language, and audio reasoning within a single framework. |
| |
|
| |
|
| | ### Model Description |
| |
|
| | - We propose **Flex-Judge**, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. |
| | - Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge. |
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| | - **Repository:** https://github.com/jongwooko/flex-judge |
| | - **Paper:** [Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators |
| | ](https://arxiv.org/abs/2505.18601) |
| |
|
| | ## Uses |
| |
|
| | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| | For more comprehensive usage examples and implementation details, please refer to our official repository. |
| |
|
| | ### Requirements |
| |
|
| | <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
| |
|
| | ``` |
| | pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview |
| | pip accelerate |
| | pip install qwen-omni-utils[decord] -U |
| | pip install vllm |
| | pip install datasets |
| | ``` |
| |
|
| | ### Using vLLM |
| |
|
| | Here, we recommend using `vllm` instead of `transformers` to improve inference speed. The results in our papers are based on the `vllm` library. |
| |
|
| | ``` |
| | from datasets import load_dataset |
| | from vllm import LLM, SamplingParams |
| | |
| | # default: Load the model on the available device(s) |
| | llm = LLM( |
| | "jongwooko/Flex-Omni-7B", |
| | tensor_parallel_size=4, |
| | limit_mm_per_prompt={"image": 1}, # The maximum number to accept |
| | ) |
| | sampling_params = SamplingParams( |
| | max_tokens=4096, |
| | temperature=0.2, |
| | top_p=0.95, |
| | ) |
| | |
| | # Example |
| | example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0] |
| | question, image = example["query"], example["image"] |
| | answer1, answer2 = example["response"] |
| | |
| | # System prompt for Flex-Judge |
| | SYSTEM_PROMPT = ( |
| | "You are a helpful assistant. The assistant first performs a detailed, " |
| | "step-by-step reasoning process in its mind and then provides the user with" |
| | "the answer. The reasoning process and answer are enclosed within <think> " |
| | "reasoning process here, explaining each step of your evaluation for both " |
| | "assistants </think><answer> answer here </answer>. Now the user asks you " |
| | "to judge the performance of two AI assistants in response to the question. " |
| | "Score assistants 1-10 (higher=better). Criteria includes helpfulness, " |
| | "relevance, accuracy, and level of detail. Avoid order, length, style or " |
| | "other bias. After thinking, when you finally reach a conclusion, clearly " |
| | "provide your evaluation scores within <answer> </answer> tags, i.e., for " |
| | "example, <answer>3</answer><answer>5</answer>" |
| | ) |
| | |
| | instruction = ( |
| | f"<|vision_start|><|IMAGE|><|vision_end|>\n\n[Question]\n{question}\n\n" |
| | "[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}" |
| | ) |
| | prompt = ( |
| | f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n" |
| | f"<|im_start|>user\n{instruction}<|im_end|>\n" |
| | "<|im_start|>assistant\n<think>\n\n" |
| | ) |
| | inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}} |
| | |
| | # Inference: Generation of the output |
| | outputs = llm.generate([inputs], sampling_params=sampling_params) |
| | output_text = outputs[0].outputs[0].text |
| | print (output_text) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
|
| | **BibTeX:** |
| |
|
| | ``` |
| | @article{ko2025flex, |
| | title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators}, |
| | author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young}, |
| | journal={arXiv preprint arXiv:2505.18601}, |
| | year={2025} |
| | } |
| | ``` |