Improve model card: Add pipeline tag, paper link, abstract, code, and usage

2e7f0a6 verified 6 months ago

7.64 kB

base_model: Qwen/Qwen2.5-VL-3B-Instruct
library_name: transformers
license: other
tags:
  - llama-factory
  - full
  - generated_from_trainer
  - vision-language-model
model-index:
  - name: Qwen2.5-VL-3B-Instruct
    results: []
pipeline_tag: image-text-to-text

Qwen2.5-VL-3B-Instruct: Self-Rewarding Vision-Language Model via Reasoning Decomposition

This model is a fine-tuned version of Qwen/Qwen2.5-VL-3B-Instruct on the mllm_data1_cotOnly and the mllm_data1_description_val_text_only datasets. It was presented in the paper Self-Rewarding Vision-Language Model via Reasoning Decomposition.

Code: https://github.com/zli12321/Vision-SR1

Abstract

Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

Model description

Vision-SR1 is a self-rewarded Reinforcement Learning (RL) training framework that decomposes Vision-Language Models' (VLMs) language reasoning into visual perception reasoning and language reasoning. Inspired by works like Vision-R1, Visionary-R1, and R1-VL, Vision-SR1 leverages the VLM's self-evolving and reasoning ability to reward itself.

VLMs often rely primarily on language reasoning rather than visual perception because they fuse the vision encoder with the LLM backbone late in pretraining. Standard RL training can lead to recalling prior language knowledge for accuracy gains while neglecting vision. External LLM-based perception rewards can help but introduce bias and heavy latency. Vision-SR1 proposes a self-reward framework, enabling the model to provide its own visual and reasoning feedback with no latency, thereby strengthening both visual perception and language reasoning, mitigating visual hallucinations, and reducing reliance on language shortcuts.

Intended uses & limitations

This model is intended for research in Vision-Language Models, particularly for tasks benefiting from improved visual reasoning, mitigation of visual hallucinations, and reduced reliance on language shortcuts.

Limitations:

LLM evaluation scripts and model generation outputs with LLM judgments are currently in progress.

Training and evaluation data

The training dataset used for Vision-SR1 is sourced from 23 sources and evenly split across three main areas: general visual understanding, science knowledge, and multimodal mathematical reasoning.

Specific datasets constructed for Vision-SR1 training include:

📊 Vision-SR1-Cold-Start-9K (for Supervised Fine-Tuning, SFT)
📊 Vision-SR1-47K (for Reinforcement Learning, RL)

Sample Usage

The following snippets are adopted directly from the Vision-SR1 GitHub repository to demonstrate setup and training procedures.

Requirements

git clone https://github.com/zli12321/Vision-SR1.git
cd Vision-SR1
conda create -n Vision-SR1 python=3.11
bash setup.sh

GRPO Training

### Self-Reward Vision-SR1 GRPO Training
bash ./train_examples/2-7b_selfReward_train.sh

### Vision-SR1 regular training
bash ./train_examples/1-7b_visionR1_train.sh

Merge checkpoints

python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor

Generating Evaluation Responses

bash ./validation_examples/2-seethink_format_eval.sh

Supervised Finetuning Setup

The supervised finetuning code is adopted from LLaMA-Factory for easy setup.

conda create -n SFT python=3.11
cd LLaMA-Factory-Cold-Start
pip install -e ".[torch,metrics]" --no-build-isolation

pip install --upgrade huggingface_hub
huggingface-cli login

Supervised Finetuning Training

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/Vision-SR1-Cold-Start.yaml

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

The training concluded with the following overall results:

epoch: 2.957983193277311
total_flos: 92447203917824.0
train_loss: 0.6085002004763501
train_runtime: 1135.371
train_samples_per_second: 20.124
train_steps_per_second: 0.156

Reward progression during training:

Framework versions

Transformers 4.49.0
Pytorch 2.7.1+cu126
Datasets 3.6.0
Tokenizers 0.21.1

Citation

If you use this model or find our works helpful, please cite the original paper: Self-Rewarding Vision-Language Model via Reasoning Decomposition.

We also recommend to cite the source code work EasyR1:

@misc{zheng2025easyr1,
  title        = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
  author       = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
  howpublished = {\url{https://github.com/hiyouga/EasyR1}},
  year         = {2025}
}