Improve model card: Add pipeline tag, library name, and comprehensive details from GitHub
Browse filesThis PR significantly enhances the model card for the `Reason-RFT` project by:
1. **Adding Metadata**:
* `pipeline_tag: image-text-to-text` is added, as the model is a Visual Language Model (VLM) designed for visual reasoning, taking images and text as input to generate text. This improves discoverability on the Hugging Face Hub.
* `library_name: transformers` is added, as evidenced by the model's architecture (`Qwen2VLForConditionalGeneration`) and components (`Qwen2Tokenizer`, `Qwen2VLProcessor`) found in the `config.json` and `tokenizer_config.json` files. This will enable automated code snippets for easy usage.
2. **Updating Content for Clarity and Completeness**:
* The main title has been updated to `# Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models` to align with the paper and official GitHub repository title.
* The "News" and "Citation" sections have been updated with the latest and most comprehensive information available from the project's GitHub README, including recent announcements and additional relevant citations.
* Detailed "RoadMap", "Pipeline", "General Visual Reasoning Tasks" (including Setup, Dataset Preparation, Training, and Evaluation instructions), and "Embodied Visual Reasoning Tasks" sections have been integrated from the GitHub README. These provide extensive usage guidance and project context, replacing the generic "Usage" link.
* Malformed HTML in the header links (`<p align="center"> ... </p>`) has been corrected for better rendering and validity.
These changes provide a more informative, up-to-date, and user-friendly model card for the community.
|
@@ -1,41 +1,31 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
- en
|
| 5 |
datasets:
|
| 6 |
- tanhuajie2001/Reason-RFT-CoT-Dataset
|
|
|
|
|
|
|
|
|
|
| 7 |
metrics:
|
| 8 |
- accuracy
|
| 9 |
-
|
| 10 |
-
|
| 11 |
---
|
| 12 |
|
| 13 |
<div align="center">
|
| 14 |
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
|
| 15 |
</div>
|
| 16 |
|
| 17 |
-
#
|
| 18 |
-
*The model checkpoints in our project "Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning"*.
|
| 19 |
-
|
| 20 |
|
| 21 |
<p align="center">
|
| 22 |
-
|
| 23 |
</p>
|
| 24 |
|
| 25 |
<p align="center">
|
| 26 |
-
|
| 27 |
</p>
|
| 28 |
|
| 29 |
-
## β£οΈ Model List
|
| 30 |
-
|
| 31 |
-
| Tasks | Reason-RFT-Zero-2B | Reason-RFT-Zero-7B | Reason-RFT-2B | Reason-RFT-7B |
|
| 32 |
-
|------------------------|---------------------------|---------------------|---------------------------|---------------------------|
|
| 33 |
-
| Visual Counting | [π€VC-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-2B) | [π€VC-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-7B) | [π€VC-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B) | [π€VC-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-7B) |
|
| 34 |
-
| Structure Perception | [π€SP-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-2B) | [π€SP-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-7B) | [π€SP-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-2B) | [π€SP-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-7B) |
|
| 35 |
-
| Spatial Transformation | [π€ST-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-2B) | [π€ST-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-7B) | [π€ST-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-2B) | [π€ST-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-7B) |
|
| 36 |
-
| ***Embodied Tasks*** | π€ *Stay Turned* | π€ *Stay Turned* | π€ *Stay Turned* | π€ *Stay Turned* |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
## π₯ Overview
|
| 40 |
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI).
|
| 41 |
Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities.
|
|
@@ -52,18 +42,358 @@ Experimental results demonstrate Reasoning-RFT's three key advantages: **(1) Per
|
|
| 52 |
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/overview.png" />
|
| 53 |
</div>
|
| 54 |
|
| 55 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
- **`2025-04-04`**: π€ We released our [datasets](https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset/) to huggingface for [General Visual Reasoning Tasks](#GeneralVisualTasks).
|
| 59 |
- **`2025-04-02`**: π₯ We released codes and scripts for training/evaluation on [General Visual Reasoning Tasks](#GeneralVisualTasks).
|
| 60 |
- **`2025-03-29`**: π We released the [repository](https://github.com/tanhuajie/Reason-RFT/) and [roadmap](#RoadMap) for **Reason-RFT**.
|
| 61 |
- **`2025-03-26`**: π We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
|
| 62 |
|
| 63 |
|
| 64 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## π Citation
|
| 69 |
If you find this project useful, welcome to cite us.
|
|
@@ -74,4 +404,18 @@ If you find this project useful, welcome to cite us.
|
|
| 74 |
journal={arXiv preprint arXiv:2503.20752},
|
| 75 |
year={2025}
|
| 76 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2-VL-2B-Instruct
|
|
|
|
| 4 |
datasets:
|
| 5 |
- tanhuajie2001/Reason-RFT-CoT-Dataset
|
| 6 |
+
language:
|
| 7 |
+
- en
|
| 8 |
+
license: apache-2.0
|
| 9 |
metrics:
|
| 10 |
- accuracy
|
| 11 |
+
pipeline_tag: image-text-to-text
|
| 12 |
+
library_name: transformers
|
| 13 |
---
|
| 14 |
|
| 15 |
<div align="center">
|
| 16 |
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/logo.png" width="500"/>
|
| 17 |
</div>
|
| 18 |
|
| 19 |
+
# Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
|
|
|
|
|
|
|
| 20 |
|
| 21 |
<p align="center">
|
| 22 |
+
βοΈ <a href="https://tanhuajie.github.io/ReasonRFT/">Project</a> β π <a href="https://github.com/tanhuajie/Reason-RFT">Github</a> β π₯ <a href="https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset">Dataset</a> β π <a href="https://arxiv.org/abs/2503.20752">Paper</a> β π¬ <a href="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/wechat.png">WeChat</a>
|
| 23 |
</p>
|
| 24 |
|
| 25 |
<p align="center">
|
| 26 |
+
π€ <a href="https://github.com/FlagOpen/RoboBrain/">RoboBrain</a>: Aim to Explore ReasonRFT Paradigm to Enhance RoboBrain's Embodied Reasoning Capabilities.
|
| 27 |
</p>
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## π₯ Overview
|
| 30 |
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI).
|
| 31 |
Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities.
|
|
|
|
| 42 |
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/overview.png" />
|
| 43 |
</div>
|
| 44 |
|
| 45 |
+
## <a id="RoadMap"> π― RoadMap</a>
|
| 46 |
+
|
| 47 |
+
- **`Support different VLMs`**: [RoboBrain](https://github.com/FlagOpen/RoboBrain/), [Qwen2-VL series](https://github.com/QwenLM/Qwen2.5-VL/), [Llava-VL series](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
| 48 |
+
- Explore an efficient training paradigm to enhance [RoboBrain](https://github.com/FlagOpen/RoboBrain/)'s embodied reasoning capabilities.
|
| 49 |
+
- **`Support General Visual Reasoning Tasks`**:
|
| 50 |
+
- Data download and preparation: Please refer to [General Visual Reasoning Tasks](#Preparation).
|
| 51 |
+
- Training and evaluating for **Visual Counting**: Please refer to [Visual Counting Section](#GeneralVisualTasks).
|
| 52 |
+
- Training and evaluating for **Struction Perception**: Please refer to [Struction Perception Section](#GeneralVisualTasks).
|
| 53 |
+
- Training and evaluating for **Spatial Transformation**: Please refer to [Spatial Transformation Section](#GeneralVisualTasks).
|
| 54 |
+
- **`Support Embodied Visual Reasoning Tasks`**:
|
| 55 |
+
- Data generation and preparation: Please refer to [Embodied Visual Reasoning Tasks](#EmbodiedVisualReasoningTasks).
|
| 56 |
+
- Training and evaluating for **Embodied Planning**: Please refer to [Embodied Planning Section](#EmbodiedVisualReasoningTasks).
|
| 57 |
+
- Training and evaluating for **Embodied Affordance**: Please refer to [Embodied Affordance Section](#EmbodiedVisualReasoningTasks).
|
| 58 |
+
- Training and evaluating for **Embodied Trajectory**: Please refer to [Embodied Trajectory Section](#EmbodiedVisualReasoningTasks).
|
| 59 |
+
- Training and evaluating for **Embodied Pointing**: Please refer to [Embodied Pointing Section](#EmbodiedVisualReasoningTasks).
|
| 60 |
|
| 61 |
+
|
| 62 |
+
## ποΈ News
|
| 63 |
+
- **`2025-09-18`**: π₯π₯π₯ **Reason-RFT** gets accepted to NeurIPS 2025! See you in Mexico City and San Diego, USA!
|
| 64 |
+
- **`2025-06-06`**: π€ We're excited to announce the release of our more powerful [RoboBrain 2.0](https://github.com/FlagOpen/RoboBrain2.0) using Reason-RFT.
|
| 65 |
+
- **`2025-04-13`**: β¨ We released our [model zoo](https://github.com/tanhuajie/Reason-RFT?tab=readme-ov-file#--model-zoo) to huggingface.
|
| 66 |
- **`2025-04-04`**: π€ We released our [datasets](https://huggingface.co/datasets/tanhuajie2001/Reason-RFT-CoT-Dataset/) to huggingface for [General Visual Reasoning Tasks](#GeneralVisualTasks).
|
| 67 |
- **`2025-04-02`**: π₯ We released codes and scripts for training/evaluation on [General Visual Reasoning Tasks](#GeneralVisualTasks).
|
| 68 |
- **`2025-03-29`**: π We released the [repository](https://github.com/tanhuajie/Reason-RFT/) and [roadmap](#RoadMap) for **Reason-RFT**.
|
| 69 |
- **`2025-03-26`**: π We released our initial [ArXiv paper](https://arxiv.org/abs/2503.20752/) of **Reason-RFT**.
|
| 70 |
|
| 71 |
|
| 72 |
+
## <a id="Method">βοΈ Pipeline</a>
|
| 73 |
+
|
| 74 |
+
<div align="center">
|
| 75 |
+
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/pipeline.png" />
|
| 76 |
+
</div>
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## <a id="ModelCheckpoints"> π€ Model Zoo</a>
|
| 80 |
+
|
| 81 |
+
| Tasks | Reason-RFT-Zero-2B | Reason-RFT-Zero-7B | Reason-RFT-2B | Reason-RFT-7B |
|
| 82 |
+
|------------------------|---------------------------|---------------------|---------------------------|---------------------------|
|
| 83 |
+
| Visual Counting | [π€VC-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-2B) | [π€VC-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Visual-Counting-Qwen2-VL-7B) | [π€VC-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-2B) | [π€VC-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Visual-Counting-Qwen2-VL-7B) |
|
| 84 |
+
| Structure Perception | [π€SP-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-2B) | [π€SP-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Structure-Perception-Qwen2-VL-7B) | [π€SP-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-2B) | [π€SP-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Structure-Perception-Qwen2-VL-7B) |
|
| 85 |
+
| Spatial Transformation | [π€ST-GRPO-Zero-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-2B) | [π€ST-GRPO-Zero-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Zero-Spatial-Transformation-Qwen2-VL-7B) | [π€ST-GRPO-2B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-2B) | [π€ST-GRPO-7B](https://huggingface.co/tanhuajie2001/Reason-RFT-Spatial-Transformation-Qwen2-VL-7B) |
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
## <a id="GeneralVisualTasks"> π² General Visual Reasoning Tasks</a>
|
| 89 |
+
|
| 90 |
+
### π οΈ Setup
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
# clone repo.
|
| 94 |
+
git clone https://github.com/tanhuajie/Reason-RFT.git
|
| 95 |
+
cd Reason-RFT
|
| 96 |
+
|
| 97 |
+
# build conda env. for stage_rl
|
| 98 |
+
conda create -n reasonrft_rl python=3.10
|
| 99 |
+
conda activate reasonrft_rl
|
| 100 |
+
pip install -r requirements_rl.txt
|
| 101 |
+
|
| 102 |
+
# build conda env. for stage_sft
|
| 103 |
+
conda create -n reasonrft_sft python=3.10
|
| 104 |
+
conda activate reasonrft_sft
|
| 105 |
+
pip install -r requirements_sft.txt
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### <a id="Preparation"> β£οΈ Dataset Preparation</a>
|
| 109 |
+
|
| 110 |
+
#### Step 1: Download Dataset
|
| 111 |
+
```bash
|
| 112 |
+
# Download from tanhuajie2001/Reason-RFT-CoT-Dataset
|
| 113 |
+
huggingface-cli download --repo-type dataset --resume-download tanhuajie2001/Reason-RFT-CoT-Dataset --local-dir ./Reason-RFT-CoT-Dataset
|
| 114 |
+
|
| 115 |
+
# unzip images
|
| 116 |
+
cd Reason-RFT-CoT-Dataset
|
| 117 |
+
unzip train_images.zip
|
| 118 |
+
unzip test_images.zip
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
Then, your local directory should be like:
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
Reason-RFT-CoT-Dataset/
|
| 125 |
+
β # Images for training & evaluation
|
| 126 |
+
βββ images/
|
| 127 |
+
β βββ train_images/
|
| 128 |
+
β β βββ ...
|
| 129 |
+
β βββ test_images/
|
| 130 |
+
β βββ ...
|
| 131 |
+
β # CoT datasets for training
|
| 132 |
+
βββ train_jsons/
|
| 133 |
+
οΏ½οΏ½ β # Full datasets for Spatial-Transformation task
|
| 134 |
+
β βββ A1-Spatial-Transformation-train-60k-cot.json
|
| 135 |
+
β β # Full datasets for Structure-Perception task
|
| 136 |
+
β βββ A2-Structure-Perception-train-4k5-cot.json
|
| 137 |
+
β β # Full datasets for Visual-Counting task
|
| 138 |
+
β βββ A3-Visual-Counting-train-35k-cot.json
|
| 139 |
+
β β # Scientific visual reasoning (Optional for extensive/ablation exp.)
|
| 140 |
+
β βββ AI2D-train-1467-cot.json
|
| 141 |
+
β βββ ScienceQA-train-2112-cot.json
|
| 142 |
+
β β # Topological visual reasoning (Optional for extensive/ablation exp.)
|
| 143 |
+
β βββ GVLQA-connectivity-train-1199-cot.json
|
| 144 |
+
β βββ GVLQA-cycle-train-1194-cot.json
|
| 145 |
+
β βββ GVLQA-hamilton-train-1158-cot.json
|
| 146 |
+
β βββ GVLQA-topology-train-1070-cot.json
|
| 147 |
+
β βββ GVLQA-matching-train-1193-cot.json
|
| 148 |
+
β β # Pattern & Puzzle visual reasoning (Optional for extensive/ablation exp.)
|
| 149 |
+
β βββ PuzzleVQA-train-1618-cot.json
|
| 150 |
+
β βββ IconQA-train-5270-cot.json
|
| 151 |
+
β βββ Raven-train-982-cot.json
|
| 152 |
+
β β # Geometric visual reasoning (Optional for extensive/ablation exp.)
|
| 153 |
+
β βββ GeoQA-train-1500-cot.json
|
| 154 |
+
β βββ GeomVerse-train-2841-cot.json
|
| 155 |
+
β βββ Geometry3K-train-3794-cot.json
|
| 156 |
+
β # Datasets for evaluation
|
| 157 |
+
βββ test_jsons/
|
| 158 |
+
β β # Evaluation for Spatial-Transformation task
|
| 159 |
+
β βββ Spatial-Transformation-id-test-1k.json # In-Domain
|
| 160 |
+
β βββ Spatial-Transformation-ood-left-test-1k.json # Out-of-Domain
|
| 161 |
+
β βββ Spatial-Transformation-ood-right-test-1k.json # Out-of-Domain
|
| 162 |
+
β β # Evaluation for Structure-Perception task
|
| 163 |
+
β βββ Structure-Perception-id-test-820.json # In-Domain
|
| 164 |
+
β βββ Structure-Perception-ood-test-800.json # Out-of-Domain
|
| 165 |
+
β β # Evaluation for Visual-Counting task
|
| 166 |
+
β βββ Visual-Counting-id-test-1k.json # In-Domain
|
| 167 |
+
β βββ Visual-Counting-ood-test-1k.json # Out-of-Domain
|
| 168 |
+
βββ README.md
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
#### Step 2: Construct Dataset for ANS-SFT, COT-SFT, Reason-RFT(-Zero)
|
| 172 |
+
|
| 173 |
+
*Note:* If you want to implement training process for our three main tasks, only three Meta-Json files -- **'A1-Spatial-Transformation-train-60k-cot.json', 'A2-Structure-Perception-train-4k5-cot.json' and 'A3-Visual-Counting-train-35k-cot.json'** -- are needed respectively. Then, all the training-used json files can be constructed by these three Meta-Json files above. At this step, you should write the simple scripts to complete the construction for **your own training-used json files** directly, according to the sample format as shown below.
|
| 174 |
+
|
| 175 |
+
**π 1. For *ANS-SFT* training, we use ShareGPT format to refactor each sample:**
|
| 176 |
+
|
| 177 |
+
```json
|
| 178 |
+
{
|
| 179 |
+
"id": "{id}",
|
| 180 |
+
"image": "{image}",
|
| 181 |
+
"messages": [
|
| 182 |
+
{
|
| 183 |
+
"content": "{PROMPT_xxx_ANS_SFT} + <image> + {problem}",
|
| 184 |
+
"role": "user"
|
| 185 |
+
},
|
| 186 |
+
{
|
| 187 |
+
"content": "{answer}",
|
| 188 |
+
"role": "assistant"
|
| 189 |
+
}
|
| 190 |
+
]
|
| 191 |
+
},
|
| 192 |
+
```
|
| 193 |
+
```
|
| 194 |
+
Tips: {Prompt_xxx_ANS_SFT} can be found in ./utils/prompts.py, while {id}, {image}, {problem} and {answer} are from Meta-Json files.
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
**π 2. For *COT-SFT* training, we also use ShareGPT format to refactor each sample:**
|
| 198 |
+
|
| 199 |
+
```json
|
| 200 |
+
{
|
| 201 |
+
"id": "{id}",
|
| 202 |
+
"image": "{image}",
|
| 203 |
+
"messages": [
|
| 204 |
+
{
|
| 205 |
+
"content": "{PROMPT_xxx_COT_SFT} + <image> + {problem}",
|
| 206 |
+
"role": "user"
|
| 207 |
+
},
|
| 208 |
+
{
|
| 209 |
+
"content": "<think>
|
| 210 |
+
{cot}
|
| 211 |
+
</think>
|
| 212 |
+
<answer>
|
| 213 |
+
{answer}
|
| 214 |
+
</answer>",
|
| 215 |
+
"role": "assistant"
|
| 216 |
+
}
|
| 217 |
+
]
|
| 218 |
+
},
|
| 219 |
+
```
|
| 220 |
+
```
|
| 221 |
+
Tips: {Prompt_xxx_COT_SFT} can be found in ./utils/prompts.py, while {id}, {image}, {problem}, {cot} and {answer} are from Meta-Json files.
|
| 222 |
+
|
| 223 |
+
Note: when refactor 'Structure Perception' task, {Prompt_xxx_COT_SFT} can be devided into 'PROMPT_STRUCTURE_PERCEPTION_CHOICE_COT_SFT' and 'PROMPT_STRUCTURE_PERCEPTION_NON_CHOICE_COT_SFT'. If 'A' or 'B' or 'C' or 'D' in {answer}, we use 'PROMPT_STRUCTURE_PERCEPTION_CHOICE_COT_SFT', otherwise, we use 'PROMPT_STRUCTURE_PERCEPTION_NON_CHOICE_COT_SFT'.
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
**π 3. For *Reason-RFT-Zero* training, we use RL format below to refactor each sample:**
|
| 227 |
+
|
| 228 |
+
```json
|
| 229 |
+
{
|
| 230 |
+
"id": "{id}",
|
| 231 |
+
"image": "{image}",
|
| 232 |
+
"problem": "{problem}",
|
| 233 |
+
"solution": "{answer}"
|
| 234 |
+
},
|
| 235 |
+
```
|
| 236 |
+
```
|
| 237 |
+
Tips: {id}, {image}, {problem} and {answer} are from Meta-Json files.
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
**π 4. For *Reason-RFT* training, we use COT-SFT format to refactor 1.6k samples for STAGE-1, and use RL format to refactor the rest samples or full samples for STAGE-2.**
|
| 241 |
|
| 242 |
+
*Specifically, in STAGE 2, we refactor full samples for Structure-Perception task due to its too limited training samples (only 4.5k), while we refactor only the rest samples for Spatial-Transformation task and Visual-Counting task.*
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
#### Step 3: Change Path to Your Own Constructed Datasets
|
| 246 |
+
```bash
|
| 247 |
+
# SFT Training:
|
| 248 |
+
change dataset paths defined in './train/stage_sft/dataset_info.json' file.
|
| 249 |
+
|
| 250 |
+
# RL Training:
|
| 251 |
+
change dataset paths defined in './scripts/train/reason_rft/stage_rl/xxx.bash' file.
|
| 252 |
+
change dataset paths defined in './scripts/train/reason_rft_zero/xxx.bash' file.
|
| 253 |
+
|
| 254 |
+
# Evaluation:
|
| 255 |
+
change dataset paths defined in './eval/eval_by_vllm_for_open_source.py' file.
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
### <a id="Training"> π Training</a>
|
| 259 |
+
|
| 260 |
+
```bash
|
| 261 |
+
# ANS-SFT, Task1 (Visual-Counting), Qwen2-vl-2b
|
| 262 |
+
bash scripts/train/ans_sft/resume_finetune_qwen2vl_2b_task1_ans_sft.sh
|
| 263 |
+
|
| 264 |
+
# ANS-SFT, Task1 (Visual-Counting), Qwen2-vl-7b
|
| 265 |
+
bash scripts/train/ans_sft/resume_finetune_qwen2vl_7b_task1_ans_sft.sh
|
| 266 |
+
|
| 267 |
+
# ANS-SFT, Task2 (Structure-Perception), Qwen2-vl-2b
|
| 268 |
+
bash scripts/train/ans_sft/resume_finetune_qwen2vl_2b_task2_ans_sft.sh
|
| 269 |
+
|
| 270 |
+
# ANS-SFT, Task2 (Structure-Perception), Qwen2-vl-7b
|
| 271 |
+
bash scripts/train/ans_sft/resume_finetune_qwen2vl_7b_task2_ans_sft.sh
|
| 272 |
+
|
| 273 |
+
# ANS-SFT, Task3 (Spatial-Transformation), Qwen2-vl-2b
|
| 274 |
+
bash scripts/train/ans_sft/resume_finetune_qwen2vl_2b_task3_ans_sft.sh
|
| 275 |
+
|
| 276 |
+
# ANS-SFT, Task3 (Spatial-Transformation), Qwen2-vl-7b
|
| 277 |
+
bash scripts/train/ans_sft/resume_finetune_qwen2vl_7b_task3_ans_sft.sh
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
```bash
|
| 281 |
+
# COT-SFT, Task1 (Visual-Counting), Qwen2-vl-2b
|
| 282 |
+
bash scripts/train/cot_sft/resume_finetune_qwen2vl_2b_task1_cot_sft.sh
|
| 283 |
+
|
| 284 |
+
# COT-SFT, Task1 (Visual-Counting), Qwen2-vl-7b
|
| 285 |
+
bash scripts/train/cot_sft/resume_finetune_qwen2vl_7b_task1_cot_sft.sh
|
| 286 |
+
|
| 287 |
+
# COT-SFT, Task2 (Structure-Perception), Qwen2-vl-2b
|
| 288 |
+
bash scripts/train/cot_sft/resume_finetune_qwen2vl_2b_task2_cot_sft.sh
|
| 289 |
+
|
| 290 |
+
# COT-SFT, Task2 (Structure-Perception), Qwen2-vl-7b
|
| 291 |
+
bash scripts/train/cot_sft/resume_finetune_qwen2vl_7b_task2_cot_sft.sh
|
| 292 |
+
|
| 293 |
+
# COT-SFT, Task3 (Spatial-Transformation), Qwen2-vl-2b
|
| 294 |
+
bash scripts/train/cot_sft/resume_finetune_qwen2vl_2b_task3_cot_sft.sh
|
| 295 |
+
|
| 296 |
+
# COT-SFT, Task3 (Spatial-Transformation), Qwen2-vl-7b
|
| 297 |
+
bash scripts/train/cot_sft/resume_finetune_qwen2vl_7b_task3_cot_sft.sh
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
```bash
|
| 301 |
+
# Reason-RFT-Zero, Task1 (Visual-Counting), Qwen2-vl-2b
|
| 302 |
+
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_2b_task1_only_rl.sh
|
| 303 |
+
|
| 304 |
+
# Reason-RFT-Zero, Task1 (Visual-Counting), Qwen2-vl-7b
|
| 305 |
+
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_7b_task1_only_rl.sh
|
| 306 |
+
|
| 307 |
+
# Reason-RFT-Zero, Task2 (Structure-Perception), Qwen2-vl-2b
|
| 308 |
+
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_2b_task2_only_rl.sh
|
| 309 |
+
|
| 310 |
+
# Reason-RFT-Zero, Task2 (Structure-Perception), Qwen2-vl-7b
|
| 311 |
+
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_7b_task2_only_rl.sh
|
| 312 |
+
|
| 313 |
+
# Reason-RFT-Zero, Task3 (Spatial-Transformation), Qwen2-vl-2b
|
| 314 |
+
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_2b_task3_only_rl.sh
|
| 315 |
+
|
| 316 |
+
# Reason-RFT-Zero, Task3 (Spatial-Transformation), Qwen2-vl-7b
|
| 317 |
+
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_7b_task3_only_rl.sh
|
| 318 |
+
```
|
| 319 |
+
|
| 320 |
+
```bash
|
| 321 |
+
# Reason-RFT, Task1 (Visual-Counting), Qwen2-vl-2b, STAGE1 + STAGE2
|
| 322 |
+
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_2b_task1_stage1_sft.sh
|
| 323 |
+
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_2b_task1_stage2_rl.sh
|
| 324 |
+
|
| 325 |
+
# Reason-RFT, Task1 (Visual-Counting), Qwen2-vl-7b, STAGE1 + STAGE2
|
| 326 |
+
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_7b_task1_stage1_sft.sh
|
| 327 |
+
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_7b_task1_stage2_rl.sh
|
| 328 |
+
|
| 329 |
+
# Reason-RFT, Task2 (Structure-Perception), Qwen2-vl-2b, STAGE1 + STAGE2
|
| 330 |
+
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_2b_task2_stage1_sft.sh
|
| 331 |
+
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_2b_task2_stage2_rl.sh
|
| 332 |
+
|
| 333 |
+
# Reason-RFT, Task2 (Structure-Perception), Qwen2-vl-7b, STAGE1 + STAGE2
|
| 334 |
+
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_7b_task2_stage1_sft.sh
|
| 335 |
+
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_7b_task2_stage2_rl.sh
|
| 336 |
+
|
| 337 |
+
# Reason-RFT, Task3 (Spatial-Transformation), Qwen2-vl-2b, STAGE1 + STAGE2
|
| 338 |
+
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_2b_task3_stage1_sft.sh
|
| 339 |
+
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_2b_task3_stage2_rl.sh
|
| 340 |
+
|
| 341 |
+
# Reason-RFT, Task3 (Spatial-Transformation), Qwen2-vl-7b, STAGE1 + STAGE2
|
| 342 |
+
bash scripts/train/reason_rft/stage_sft/resume_finetune_qwen2vl_7b_task3_stage1_sft.sh
|
| 343 |
+
bash scripts/train/reason_rft/stage_rl/resume_finetune_qwen2vl_7b_task3_stage2_rl.sh
|
| 344 |
+
```
|
| 345 |
+
**Note:** Please change the dataset, pre-trained model and image path in the scripts above.
|
| 346 |
+
|
| 347 |
+
## <a id="Evaluation"> π Evaluation</a>
|
| 348 |
+
```bash
|
| 349 |
+
# Evaluating and get scores for Proprietary Models (GPT4o, Gemini)
|
| 350 |
+
bash scripts/eval/close_source_models/evaluate_gemini.sh
|
| 351 |
+
bash scripts/eval/close_source_models/evaluate_gpt4o.sh
|
| 352 |
+
|
| 353 |
+
# Only get scores for Proprietary Models (GPT4o, Gemini), since you have already evaluated them before
|
| 354 |
+
bash scripts/eval/close_source_models/evaluate_gemini_only_calculate_score.sh
|
| 355 |
+
bash scripts/eval/close_source_models/evaluate_gpt4o_only_calculate_score.sh
|
| 356 |
+
|
| 357 |
+
# Evaluating with single GPU for ours or Open-Source Models
|
| 358 |
+
bash scripts/eval/open_source_models/single_gpu_eval/eval_by_vllm_all_tasks_zero_shot_single_gpu.sh
|
| 359 |
+
bash scripts/eval/open_source_models/single_gpu_eval/eval_by_vllm_all_tasks_ans_sft_single_gpu.sh
|
| 360 |
+
bash scripts/eval/open_source_models/single_gpu_eval/eval_by_vllm_all_tasks_cot_sft_single_gpu.sh
|
| 361 |
+
bash scripts/eval/open_source_models/single_gpu_eval/eval_by_vllm_all_tasks_reason_rft_single_gpu.sh
|
| 362 |
+
|
| 363 |
+
# Evaluating with multiple GPUs for ours or Open-Source Models
|
| 364 |
+
bash scripts/eval/open_source_models/multi_gpu_eval/eval_by_vllm_all_tasks_zero_shot_multi_gpu.sh
|
| 365 |
+
bash scripts/eval/open_source_models/multi_gpu_eval/eval_by_vllm_all_tasks_ans_sft_multi_gpu.sh
|
| 366 |
+
bash scripts/eval/open_source_models/multi_gpu_eval/eval_by_vllm_all_tasks_cot_sft_multi_gpu.sh
|
| 367 |
+
bash scripts/eval/open_source_models/multi_gpu_eval/eval_by_vllm_all_tasks_reason_rft_multi_gpu.sh
|
| 368 |
+
|
| 369 |
+
# Get scores for ours or Open-Source Models
|
| 370 |
+
bash scripts/eval/open_source_models/calculate_score/calculate_score.sh
|
| 371 |
+
```
|
| 372 |
+
**Note:** Please change the checkpoint path in the scripts above.
|
| 373 |
+
|
| 374 |
+
### <a id="OurResults"> βοΈ Results in our paper</a>
|
| 375 |
+
|
| 376 |
+
*1. Results of Spatial Transformation Task:*
|
| 377 |
+
|
| 378 |
+
<div align="center">
|
| 379 |
+
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/res_task3.png" />
|
| 380 |
+
</div>
|
| 381 |
+
|
| 382 |
+
|
| 383 |
+
*2. Full Results of Three Tasks:*
|
| 384 |
+
|
| 385 |
+
<div align="center">
|
| 386 |
+
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/result.png" />
|
| 387 |
+
</div>
|
| 388 |
+
|
| 389 |
+
*Note: For more detail results, please refer to our [main paper and appendix](https://arxiv.org/abs/2503.20752)*
|
| 390 |
+
|
| 391 |
+
## <a id="EmbodiedVisualReasoningTasks"> π€ Embodied Visual Reasoning Tasks</a>
|
| 392 |
+
We apply Reason-RFT to train more powerful RoboBrain 2.0. Please refer to [RoboBrain 2.0 Github](https://github.com/FlagOpen/RoboBrain2.0) for more details. Here is a simplified comparison result (not the final version):
|
| 393 |
+
|
| 394 |
+
<div align="center">
|
| 395 |
+
<img src="https://github.com/tanhuajie/Reason-RFT/raw/main/assets/rb2_res.png" />
|
| 396 |
+
</div>
|
| 397 |
|
| 398 |
## π Citation
|
| 399 |
If you find this project useful, welcome to cite us.
|
|
|
|
| 404 |
journal={arXiv preprint arXiv:2503.20752},
|
| 405 |
year={2025}
|
| 406 |
}
|
| 407 |
+
|
| 408 |
+
@article{team2025robobrain,
|
| 409 |
+
title={Robobrain 2.0 technical report},
|
| 410 |
+
author={Team, BAAI RoboBrain and Cao, Mingyu and Tan, Huajie and Ji, Yuheng and Lin, Minglan and Li, Zhiyu and Cao, Zhou and Wang, Pengwei and Zhou, Enshen and Han, Yi and others},
|
| 411 |
+
journal={arXiv preprint arXiv:2507.02029},
|
| 412 |
+
year={2025}
|
| 413 |
+
}
|
| 414 |
+
|
| 415 |
+
@article{ji2025robobrain,
|
| 416 |
+
title={RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete},
|
| 417 |
+
author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
|
| 418 |
+
journal={arXiv preprint arXiv:2502.21257},
|
| 419 |
+
year={2025}
|
| 420 |
+
}
|
| 421 |
```
|