Model Summary

UnifiedReward-Think-qwen35-4b is the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.

For further details, please refer to the following resources:

📰 Paper: https://arxiv.org/pdf/2505.03318
🪐 Project Page: https://codegoat24.github.io/UnifiedReward/think
🤗 Model Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
🤗 Dataset Collections: https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede
👋 Point of Contact: Yibin Wang

vLLM Server Deployment

export VLLM_DISABLE_FLASHINFER_GDN_PREFILL=1
export TOKENIZERS_PARALLELISM=false
vllm serve CodeGoat24/UnifiedReward-Think-qwen35-4b \
 --host localhost \
 --port 8080 \
 --trust-remote-code \
 --served-model-name UnifiedReward \
 --gpu-memory-utilization 0.95 \
 --mm-encoder-tp-mode data \
 --mm-processor-cache-type shm \
 --enable-prefix-caching \
 --tensor-parallel-size 8 \
 --default-chat-template-kwargs '{"enable_thinking": false}'

The inference code is provided here.

Citation

@article{unifiedreward-think,
  title={Unified multimodal chain-of-thought reward model through reinforcement fine-tuning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2505.03318},
  year={2025}
}