UnifiedReward 2.0 Qwen3.5 Models
Collection
5 items โข Updated
UnifiedReward-Think-qwen35-4b is the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.
For further details, please refer to the following resources:
export VLLM_DISABLE_FLASHINFER_GDN_PREFILL=1
export TOKENIZERS_PARALLELISM=false
vllm serve CodeGoat24/UnifiedReward-Think-qwen35-4b \
--host localhost \
--port 8080 \
--trust-remote-code \
--served-model-name UnifiedReward \
--gpu-memory-utilization 0.95 \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--enable-prefix-caching \
--tensor-parallel-size 8 \
--default-chat-template-kwargs '{"enable_thinking": false}'
The inference code is provided here.
@article{unifiedreward-think,
title={Unified multimodal chain-of-thought reward model through reinforcement fine-tuning},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2505.03318},
year={2025}
}
Base model
Qwen/Qwen3.5-4B-Base