Improve model card for Vision-SR1-7B with metadata and links
Browse filesThis PR significantly improves the model card for the `Osilly/Vision-R1-7B` model, which corresponds to the **Vision-SR1-7B** model introduced in the paper "Self-Rewarding Vision-Language Model via Reasoning Decomposition".
The changes include:
* **Adding `library_name: transformers`** to the metadata. This enables the automated "Use in Transformers" code snippet, leveraging the detected `Qwen2_5_VLForConditionalGeneration` architecture and `transformers` dependency mentioned in the `config.json` and GitHub README.
* **Adding `pipeline_tag: image-text-to-text`** to the metadata, which ensures the model is discoverable under the appropriate task on the Hugging Face Hub (e.g., https://huggingface.co/models?pipeline_tag=image-text-to-text). This aligns with the model's functionality as a Vision-Language Model.
* **Integrating the paper's abstract** and key information from the GitHub README to provide a comprehensive "About Vision-SR1" section.
* **Including direct links** to the official paper (https://huggingface.co/papers/2508.19652) and the GitHub repository (https://github.com/zli12321/Vision-SR1).
* **Embedding relevant figures** from the GitHub repository to visually explain the method and dataset.
* **Adding links to associated models and datasets** on the Hugging Face Hub, as found in the GitHub README.
* **Providing installation and training setup instructions** based on the GitHub repository.
* **Noting the incorrect BibTeX citation** in the original GitHub README for the Vision-SR1 paper and including the correct citation for the EasyR1 codebase.
Please review these updates.
|
@@ -1,3 +1,110 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Vision-SR1: Self-Rewarding Vision-Language Model via Reasoning Decomposition
|
| 8 |
+
|
| 9 |
+
This repository hosts the **Vision-SR1-7B** model, a self-rewarding Vision-Language Model (VLM) for improved visual reasoning, as presented in the paper [Self-Rewarding Vision-Language Model via Reasoning Decomposition](https://huggingface.co/papers/2508.19652).
|
| 10 |
+
|
| 11 |
+
๐ [Paper](https://huggingface.co/papers/2508.19652) | ๐ป [Code](https://github.com/zli12321/Vision-SR1)
|
| 12 |
+
|
| 13 |
+
## About Vision-SR1
|
| 14 |
+
|
| 15 |
+
Vision-SR1 is a novel self-rewarding method designed to enhance visual reasoning in Vision-Language Models (VLMs) through a reinforcement learning framework, without relying on external visual supervisions. Traditional VLMs often struggle with visual hallucinations and language shortcuts due to sparse visual signals, tending to prioritize language over visual perception.
|
| 16 |
+
|
| 17 |
+
Vision-SR1 addresses these limitations by decomposing VLM reasoning into two distinct stages: visual perception and language reasoning. The model is first prompted to generate self-contained visual perceptions that are sufficient to answer a given question without referring back to the input image. Subsequently, the same VLM model is re-prompted to perform language reasoning using only the generated perception as input, which then computes a self-reward. This self-reward, combined with supervision on the final outputs, delivers a balanced training signal that strengthens both visual perception and language reasoning. Experimental results demonstrate that Vision-SR1 effectively improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across various vision-language tasks.
|
| 18 |
+
|
| 19 |
+
The framework for Vision-SR1 is illustrated below:
|
| 20 |
+
|
| 21 |
+
<p align="center">
|
| 22 |
+
<img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/method.png" width="80%">
|
| 23 |
+
</p>
|
| 24 |
+
|
| 25 |
+
### Datasets
|
| 26 |
+
|
| 27 |
+
The training dataset for Vision-SR1 is compiled from 23 diverse sources, evenly distributed across three primary areas: general visual understanding, science knowledge, and multimodal mathematical reasoning.
|
| 28 |
+
|
| 29 |
+
<p align="center">
|
| 30 |
+
<img src="https://github.com/zli12321/Vision-SR1/raw/main/assets/data.png" width="80%">
|
| 31 |
+
</p>
|
| 32 |
+
|
| 33 |
+
Related models and datasets can be found on the Hugging Face Hub:
|
| 34 |
+
|
| 35 |
+
* **Models:**
|
| 36 |
+
* ๐ค [Vision-SR1-7B](https://huggingface.co/LMMs-Lab-Turtle/SelfRewarded-R1-7B)
|
| 37 |
+
* ๐ค [Vision-SR1-7B-Cold-Start](https://huggingface.co/LMMs-Lab-Turtle/Qwen-2.5VL-7B-Cold-Start)
|
| 38 |
+
* ๐ค [Vision-SR1-3B-Cold-Start](https://huggingface.co/LMMs-Lab-Turtle/Qwen-2.5VL-3B-Cold-Start)
|
| 39 |
+
* **Datasets:**
|
| 40 |
+
* ๐ [Vision-SR1-Cold-Start-9K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-Cold-9K)
|
| 41 |
+
* ๐ [Vision-SR1-47K](https://huggingface.co/datasets/LMMs-Lab-Turtle/Vision-SR1-47K)
|
| 42 |
+
|
| 43 |
+
## Installation and Training
|
| 44 |
+
|
| 45 |
+
The codebase for Vision-SR1 is adapted from [verl](https://github.com/volcengine/verl) and [EasyR1](https://github.com/hiyouga/EasyR1), and requires `transformers=4.49.0`.
|
| 46 |
+
|
| 47 |
+
To set up the environment and for detailed training instructions, please refer to the [official GitHub repository](https://github.com/zli12321/Vision-SR1).
|
| 48 |
+
|
| 49 |
+
### Software Requirements
|
| 50 |
+
|
| 51 |
+
* Python 3.9+
|
| 52 |
+
* transformers=4.49.0
|
| 53 |
+
|
| 54 |
+
### RL Training Setup
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
git clone https://github.com/zli12321/Vision-SR1.git
|
| 58 |
+
cd Vision-SR1
|
| 59 |
+
conda create -n Vision-SR1 python=3.11
|
| 60 |
+
bash setup.sh
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### GRPO Training
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
### Self-Reward Vision-SR1 GRPO Training
|
| 67 |
+
bash ./train_examples/2-7b_selfReward_train.sh
|
| 68 |
+
|
| 69 |
+
### Vision-SR1 regular training
|
| 70 |
+
bash ./train_examples/1-7b_visionR1_train.sh
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
### Supervised Finetuning
|
| 74 |
+
|
| 75 |
+
The supervised finetuning code is adopted from [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
|
| 76 |
+
|
| 77 |
+
#### Setup
|
| 78 |
+
```bash
|
| 79 |
+
conda create -n SFT python=3.11
|
| 80 |
+
cd LLaMA-Factory-Cold-Start
|
| 81 |
+
pip install -e ".[torch,metrics]" --no-build-isolation
|
| 82 |
+
|
| 83 |
+
pip install --upgrade huggingface_hub
|
| 84 |
+
huggingface-cli login
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
#### Training
|
| 88 |
+
```bash
|
| 89 |
+
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/Vision-SR1-Cold-Start.yaml
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
## Citation
|
| 93 |
+
|
| 94 |
+
If you find our work helpful, please cite the paper:
|
| 95 |
+
|
| 96 |
+
**Self-Rewarding Vision-Language Model via Reasoning Decomposition**
|
| 97 |
+
[https://huggingface.co/papers/2508.19652](https://huggingface.co/papers/2508.19652)
|
| 98 |
+
|
| 99 |
+
*(Note: The BibTeX citation for `luo2024semievol` provided in the original GitHub README appears to be for a different paper. Please ensure to update your local citation with the correct BibTeX for "Self-Rewarding Vision-Language Model via Reasoning Decomposition".)*
|
| 100 |
+
|
| 101 |
+
We also recommend citing the source code work, `EasyR1`, from which parts of this project were adapted:
|
| 102 |
+
|
| 103 |
+
```bibtex
|
| 104 |
+
@misc{zheng2025easyr1,
|
| 105 |
+
title = {EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework},
|
| 106 |
+
author = {Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong},
|
| 107 |
+
howpublished = {\url{https://github.com/hiyouga/EasyR1}},
|
| 108 |
+
year = {2025}
|
| 109 |
+
}
|
| 110 |
+
```
|